Glad to see SQLite on then front page -- it's one of the silent workhorses of the modern programming/database world. It is the most widely deployed database[0]. If you haven't given it a look/aren't interested in it since it seems to be a "toy" database (often the "test" or "local" db for frameworks like rails or django), you owe it to yourself to see what it can really do. Easy to use full text search[1][2], CTEs[3], JSON support via extension[4] (also the extension system is worth looking at[5]) and much much more. There are certainly things that SQLite does not do, and that's well documented too[6].
If all this doesn't convince you to use SQLite, it's also one of the most well documented large C codebases that is fantastic to learn from.
I'd go as far as to say that many modern startups could get pretty far with SQLite + aggressive caching before even bringing in a big database like Postgres (though with recent deployment/ops advancements it's easier than ever to run postgres).
I only learned this week that it supports recursive CTEs (although apparently you can't refer to them in subqueries yet?) and only yesterday that you can do this:
>>> import sqlite3
>>> con = sqlite3.connect(':memory:')
>>> cur = con.cursor()
>>> q = '''select x, sum(a * power(x, e))
... from (select -2 as x union select -1 union select 0 union select 1 union select 2) xt,
... (select 1 as a, 2 as e union select 4, 1 union select 2, 0) et
... group by x'''
>>> cur.execute(q).fetchall()
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
sqlite3.OperationalError: no such function: power
>>> con.create_function('power', 2, lambda base, power: base**power)
>>> cur.execute(q).fetchall()
[(-2, -2), (-1, -1), (0, 2), (1, 7), (2, 14)]
That is freaking cool. Not evaluating polynomials in SQL, I mean, but defining a user-defined function for SQL in one line of Python.
It simply uses the sqlite3_create_function API, so it’s as fast/slow as your Python function. You can even use Cython if pure Python isn’t fast enough. I use UDFs quite a bit and the performance is fine without extra tweaks most of the time, but for something like pow I’d use the math extension instead which would be way more performant.
Btw, user-defined scalar functions aren’t remotely as sexy as user-defined window functions.
Beware that only recent versions of Python have a sqlite3 recent enough to support window functions. This bit me recently; I had to scrap a prototype, pull all the data out, and use numpy instead. Still, in general, SQLite3 is amazing and it's my automatic first choice when working with 100GiB or less.
Python links to sqlite3 dynamically so you just need to sub in a more recent sqlite3 (>=3.25.0). I don’t think it has anything to do with the Python version.
> If you haven't given it a look/aren't interested in it since it seems to be a "toy" database
As you mentioned, it ships in every major browser, most OSes, and a number of extremely popular instances of consumer software. If that's still a "toy" I don't know what isn't…
We used SQLite for a while until we were limited by IO on our crappy SD cards for our embedded product and wanted a database server with more tuning options. Before that point, it was a very capable database, she I wouldn't hesitate to use it for pretty much any product that doesn't need high performance, cross thread access (basically any crud app) or the ability to run in a cluster/dedicated server (e.g. high storage requirements).
It's fast, simple, and easy. Perhaps my favorite feature is that backups are just a copy (in fact, when I'm debugging an app, I often copy the database into a temp directory and go to town).
I totally agree that SQLite is a great first option for startups. If your database will fit on your filesystem (gigabytes, not terabytes), it'll probably do really well. It's really easy to switch to something like Postgres later, and it's really nice to not have to worry about the complexities of a database server until you actually need them (e.g. you're considering hiring dedicated IT/DBA staff instead of just developers). And there's a good chance you'll never need anything other than SQLite.
Indeed, I believe as well that SQLite is underused in a lot of situations.
Which is mostly the reason why I created RediSQL[1], to allow a more widespread use of it also in "modern"/"cloud native" application where you want to just connect to a database via the network, ideally through common API like the Redis one.
It has the great benefit, that if you create your database in memory is crazy fast, and it retains the persistency properties of Redis, which are good enough in most cases.
Checkfront, a booking management system, got by on sqlite up until maybe 2 years ago - it was the solution for 5 or more years! These days it runs on postgres but it went very far without it.
> I'd go as far as to say that many modern startups could get pretty far with SQLite + aggressive caching before even bringing in a big database like Postgres
Modern startups need their services to be available, and this is difficult to achieve with SQLite because you can't run long database migrations without blocking writes during the migration. Personally, I'd use SQLite more often if there was a good solution to this (and also DROP COLUMN).
My statement was definitely a reach, but I think there is a bunch of space in the life of an early startup for lack of highly available services. I've found startups to prioritize time-to-market (or time to PoC) over technical purity -- there's usually lots of unplanned outages early on that often have nothing to do with the database or are a result of bad application code/misunderstood requirements as well, so I'm not fully buying the availability emphasis, but a few things that I think could counteract it:
- I'm not sure I agree that the delay of the blocking write from a migration is so long as to cause significant (if any) damage to the business.
- If SQLite makes development and deployment easier (I think it does), this becomes a tradeoff between a deployment downtime/partial operation window likely on the order of milliseconds and hours of time spent setting up/managing the alternatives. With really robust systems like Postgres out there this is a pretty weak argument on my part, but there is at least something there.
- As I've also mentioned elsewhere, a high read-to-write workload would be less affected by this problem. If your startup produces aggregated business analytics dashboards for manufacturing plant stakeholders for example, your data set could be completely static.
- WAL mode[0] improves SQLite's resiliency to concurrent readers/writers.
- In the extreme case, you could treat SQLite as a hybrid row & JSON document store, and manage your own schemas (I do not endorse this approach), but you'd still be saving some time because you can do stuff like make use of SQLite's FTS search features, and you'd have access to other relational features that you might not get with other document stores.
All this said, SQLite is not for lots of concurrent writers, and they so themselves[1], but I think it can absolutely be done. More to the point, I think that SQLite can help you get something out and easily deployable, and is not hard to move off of later.
> My statement was definitely a reach, but I think there is a bunch of space in the life of an early startup for lack of highly available services.
I think there's room for a lot fewer 9s for most businesses generally, provided you have enough isolation that one thing being down doesn't also take down everything else. The exceptions are mainly ones that are trying to be someone else's infrastructure.
For lots of businesses, especially early on, maintenance windows are entirely fine and perfect bigco rolling deploys to giant HA clusters and such are way overkill that introduce all kinds of costs and complexity (=risk) that could be avoided by just accepting that every now and then some part of your site/service will be down. Just have backups, know how to restore from them quickly, and make sure you have scripted from-scratch deployments that actually work (having developers build them daily or weekly for their own local work does a pretty decent job of sussing out brokenness) and you're probably fine.
[EDIT] but of course that's the opposite of Résumé Driven Development and is very Not Cool and likely to be unpopular with clueless managers.
manager: "Why's our site down?"
you: "a deployment broke, it's fine, we can restore from scratch in ten minutes flat if we have to."
versus, also you: "as you know we follow Google's best practices and use Kubernetes and blah blah blah and you see [translated from Bullshit Speak] we don't actually understand it very well and it's super complex and it shit the bed for some reason but we're fixing it, and as you know this is all best practices and Google like and such as"
Unfortunately the latter is often "safer" than the former... for one's career, not for the product or service you're providing.
I can see what you mean, but most startups try to have users as early as possible, to demonstrate traction. Most also practice continuous deployment. In the early stage, a lot of refactoring usually happens, which often requires database migrations. If we are afraid to refactor our database because our service may be unavailable and disappoint our early adopters, then we lose velocity and peace of mind. A tool like PostgreSQL solves this, with a small admin overhead compared to SQLite.
And the admin overhead can be reduced by using a managed service (Amazon RDS, Google Cloud SQL, Digital Ocean Managed Databases, etc.). With a managed service, we can even go serverless for the rest of the app. With SQLite, we have to closely manage and backup our server since the data are stored on it.
> I'm not sure I agree that the delay of the blocking write from a migration is so long as to cause significant (if any) damage to the business.
I agree that most migrations, in the early stage of a project, run for a few seconds to a few minutes maximum. But in my experience that's usually enough to be noticeable by our users.
> If SQLite makes development and deployment easier (I think it does)
I'm not sure exactly what the use case is but another awesome tool that's recently been making the rounds is datasette[0], it provides an easy to use data exploration point for SQLite databases. There's also a PyCon 2019 talk[1].
Hardly, that is more of a hack. There is however a distributed, clustered version of sqlite: Bloomberg's Comdb2. It has the same scaling factors as e.g. percona, can be Geo-replicated (multi-roomed) and has many advanced timing and financial data types. Plus SQL functions can be written in Lua. Oh, and it has super granular serialization guarantees, up to and including full linearizibility.
By using Sqlite as the SQL frontend, it retains features other NewSQL engines struggle with (recursive CTE, deep optimization, compatibility with existing queries etc)
Also there is dqlite[0]. I really do think that SQLite can go even farther than people think, but the exploration makes less and less sense the better and easier to run options like postgres become. That said, I'm glad projects like dqlite and rqlite exist.
There are also previous threads discussing them but here's one[1].
> I really do think that SQLite can go even farther than people think
I know different databases support different features (e.g., atomic transactions, ACID compliance, etc.), but it seems like the big differentiator is size, namely:
1. Can your database live within a single computer's RAM?
2. Can your database live within a single computer?
3. Can your database live across a cluster of computers in a single data center?
4. Can your database live across a cluster of computers spread across the planet?
Once you pick your category from the above, there are still choices to be made, but the "best" options narrow down significantly.
I'd argue that feature set (transactions, ACID, etc) are a bigger differentiator because they drive/change application development -- this can have drastic effects on application features and complexity whether immediately or down the road.
I agree that the possible second biggest (if not the first) differentiator is size though -- despite the amount of choices out there, once you start narrowing down answering those questions, many options fall away very quickly.
I'd also add on that the only real way to scale a database is to shard. There are a bunch of tricks you can do to improve throughput and those are always welcome, and sometimes I'll see articles that proclaim "we just changed to a system with 256GB of RAM to fix our scaling problem!" -- but these improvements don't get past the basic premise that once your data does not fit on one computer/harddrive you need to start figuring out a way for multiple to coordinate. You need to shard, and you need to organize your data in such a way that it can shard, then you need to start worrying about data access patterns. There's a lot of space before something like this is necessary (you really do have to have a lot of data for it to not fit on one machine these days), but I think people don't see this chain coming fast/often enough.
> I'd argue that feature set (transactions, ACID, etc) are a bigger differentiator because they drive/change application development
Handling atomic transactions or ACID on a single computer is a completely different problem from dealing with them across multiple, and still a different problem from dealing with them across datacenters. I'd argue that the size of your data is the first question you need to ask, only then you can figure out if you need/desire these features.
> once your data does not fit on one computer/harddrive you need to start figuring out a way for multiple to coordinate
Going from one computer to two generally more than doubles the complexity of the system. But when done right, going from two computers to two hundred can be easy. However, going from two hundred (in a single datacenter) to ~two thousand (across multiple datacenters) brings its own unique set of challenges.
Picking the database for an app is one of the biggest tech decisions any team can make, certainly more important than the underlying OS, and even potentially more important than choosing a predominant programming language. Even when starting a tiny startup with no initial customers, considering how big about the absolute max you think you your data can grow, and picking a database that can grow with you can save you huge headaches down the road.
Case Study: I started a company on Python 2.7 (fine, but not enlightened decision), a fork of Django that supported NoSQL (bad decision), and an underlying NoSQL database that scaled with no configuration (great decision). In the first year, the database had around a gigabyte of data, so NoSQL was definitely overkill. But now, it's over 100 Terabytes, with auto replication making it much larger than that. Changing the Django middleware or even the Python language to 3.x will be difficult, but not impossible. Moving the database is far more terrifying, a non-starter.
I used Google's NoSQL datastore database. It's generally worked well for me, and has scaled from 1GB to 100TB incredibly well, it feels like magic. But I made the decision to go with it years ago. If I were starting from scratch today, I'd want to do a lot more research before endorsing it.
Another big factor you don't mention is concurrency. SQLite is single writer/multiple readers, unlike PostgreSQL which supports multiple concurrent writers. This makes hard to run database migrations without blocking writes during the migration, which is an issue for most services requiring availability.
I have a small blog (using Ghost) that uses sqlite as the main database and it's incredibly easy to maintain. It's just a single file. For example if you want to back up, you just copy the file or add it to git. If you want to do a diff you just do sqldiff on two files. It is as easy as working with json or csv files, except with atomic writes and better querying.
> For example if you want to back up, you just copy the file or add it to git. If you want to do a diff you just do sqldiff on two files.
With Postgres, you run pg_dump on the database (which is a shell one-liner, only with slightly more arguments than cp) and that's even better for putting into git because it's a text file that git can diff natively and store efficiently with delta compression.
SQLite is even easier to start with -- I'm not trying to say Postgres is not the right choice, but it is certainly a more complex choice.
Postgres is a fantastic database, but most of the time getting started with and using SQLite will be easier. For example -- you'd never have to secure firewall rules to a SQLite instance. Bad defaults (unlikely but possible) or some bad post-install configuration and you've got an exposed server.
I want to emphasize that my point about SQLite being viable for startups was with the assumption of a low-write high read workload -- most of the time queries won't even hit the DB but instead be served by some caching layer. It's also really easy to scale read only versions of your applications horizontally if all you've got to do is ship/swap out DB files (again at this point it might make more sense to look into a bigger DB and just scaling vertically).
Furthermore, postgres has great replication, and a read-replica is trivial to set up. sqlite is amazing (for any embedded database, absolutely) but postgres can't be beat as a go to for server side.
To paraphrase the SQLite site: "SQLite isn't competing with [PostgreSQL]; it's competing with fopen(3)."
The primary purpose is to replace custom (and fragile) single-file databases with something that is far more resilient and provides uniformity in implementation. The fact that it is a suprisingly good SQL database (to the point where folks are actively arguing that much more fully-fledged databases are better in some aspects) is incidental.
sqlite can be handy but it's not a substitute for testing against a staging instance of another database platform. It ignores a lot of mistakes in column types and size limits and sometimes even foreign keys. If anything, I would want tests to be more strict than prod.
Your point is correct, but I'm afraid you might not have understood what I meant -- I meant that people should be considering using SQLite in production.
Agreed that testing against copies of production or staging instances that are as close as possible to production is important, but what I'm suggesting is that you could ride SQLite all the way into production if you wanted to.
I'd love to hear more about the ignoring column types/size limit/other issues that you've run into though, have you written about them anywhere or is there an SO post? I haven't run into anything super egregious though I've had some problems with FKs before.
Do a GROUP BY, and then try referencing a term which hasn't been either aggregated or mentioned in the GROUP BY clause. The result should be "abort! this query can't make sense!". The result (last I tried) is instead "sure, here's some arbitrary value from this group".
>Many modern startups could get pretty far with SQLite + aggressive caching
Not only did we get "pretty far", we got to what is effectively a final destination for this generation of our product. We don't even need "aggressive caching" beyond what the filesystem provides to us by way of accessing the SQLite db file.
Today, we have a single SQLite database persisting stateful data for an application that is used concurrently by 100+ users. I have benchmarked 512KB state update writes in a multithreaded scenario at >500 per second. 64KB updates exceed 1000 per second. It took us a lot of time and understanding to get to this point. At the end of the day with SQLite, less is always more. The first attempts we did involved connection pools, queues and all kinds of other fancy over-engineering in an attempt to extract more performance. We finally ended up with just the following:
- Enable Write Ahead Logging
- Set Synchronous to Normal
- Use a single SQLiteConnection
Justification for a single connection: We are using .NET with application-level locking and an RWLS where appropriate. This is in contrast to relying on the database to push these transactional scopes to disk. For us, losing an in-flight transaction tracked in memory is not catastrophic so we can tolerate this 1-100ms danger zone for the huge upside we get in throughput. This is a great example of something you can only do with SQLite. No hosted database solution offers this degree of exclusivity over your data store from an application perspective.
One other application note to keep in mind is that SQLite (by default) operates in serialized mode, so the above approach is consistent regardless of the fact that we are hitting a single connection from multiple threads at once. The usages of RWLS are required in our case simply because we need to do multiple dependent sequential operations (select+modify+update) or insertion of new items (insert+last insert row id from connection). It is hypothetically possible to avoid locking altogether (aside from the implicit SQLite serialization) if you can structure your database such that you perform all operations in a single command - E.g.: Use application-generated keys upon insertion rather than waiting for the DB to generate one.
Also, backups and restores are super trivial as mentioned elsewhere in this discussion. I have considered writing a backup agent directly into our software that is transactionally-aware. E.g.: During a backup request, I can take out a write lock on the RWLS for the SQLiteConnection while copying the database to a backup destination. Alternatively, since the DB lives on the same machine as the application, you can just take a single VM snapshot throughout the day, but this won't be as consistent a backup (tolerable for our use case though).
The biggest advantage we see with using SQLite as our principal data store is that it has allowed us to completely avoid operational annoyances regarding how we actually ship our stack to various environments. .NET Self-Contained Deployments + SQLite = one bin folder that "just works". We were able to build custom deployment tooling because of the simplicity of this model, rather than relying on off-the-shelf tech that doesn't quite fit 100% into our use case.
Recently I heard[1] that SQLite does not offer concurrent access safety. That is to say, the premise was that if you were to use it as a database backing something multi user, like a forum, SQLite would suffer due to the concurrent access of users against the DB.
While I've not had time to research it yet, I'd be curious on your (or anyones) knowledge / experience in this context.
My plan, if SQLite didn't offer concurrent access safety, was simply to put concurrent primitives around SQLite to ensure safety. However I'm not doing anything until I can make heads or tails of that original comment[1].
I'm not expecting someone to hand me an answer. I was merely asking for experiences in this area. You think I'm going to rely merely on internet stranger to decide course of action on production systems?
God forbid I ask for experiences from people who seem seasoned in this area on Hacker News of all places. The gall I must have, right?
SQLite is the most stable small DB there is, maybe even the most stable one at all.
Our CTO made a cluster wide distributed configuration filesystem[0], started about 9 years ago. For that a local backing store was required, and a a small DB seemed to fit the bill as it should allow to worry less about data safety, when it's really written out, and it could be a general interface which could be swapped out rather easily if required one day.
He checked out lots of the then available options, sorry I only remember BerkleyDB and SQLite but there were more.
Anyway, only SQLite never crashed, had corrupt data when doing weird things with it and was still very fast, just great.
Almost 10 years later we still use it, never a single issue with SQLite as the cause - really impressive IMO.
Aside from its merits as a database, SQLite is also a great example of good documentation[0], thorough testing[1], and focused development. It's something that all library maintainers should study and try to learn from.
So, I've heard SQLite described as "Postgres for people not paying attention", because it ignores data type errors.
A lot of people in the HN community like types. Me included. I really don't want my database to ignore type errors. I basically want Postgres but embedded like a library. I cannot think of any reason why that couldn't work. Perhaps just no one has built it yet.
Firebird (https://www.firebirdsql.org) can essentially work that way. It comes with two versions of the client library. One of them runs the database engine in-process (embedded mode), similar to SQLite. Later you can switch over to a separate server process, like with PostgreSQL. Though both those two set the bar pretty high, Firebird is a solid RDBMS that’s been around for a long time, with a well-balanced feature set (including strict data typing).
Interestingly, it's amazing to me that many SQLite libraries in the Go ecosystem don't support these mixed/flexible data types in tables.
If you read back a set of values and one of the fields in a row doesn't match the data type given by the column, the libraries will error out.
One library that works correctly is github.com/gwenn/gosqlite, which checks the data type of each field of the result set (per row), so handles things gracefully.
It's really not a problem. I use statically typed languages (mostly Go and Rust), and everything just works as long as my database later retrieves and stores the right types. The only slight complexity is currency, which would be stored as an interger because there's no decimal type.
Unless you have really strict requirements (generally not the case), SQLites storage classes (NULL, INTEGER, REAL, TEXT, BLOB) are sufficient. If it really bothers you, you can use extensions to provide whatever abstractions you like. If you're mostly doing CRUD, you really shouldn't have any issues.
I have recently been thinking if it were possible to marry pg query engine with sqlite storage layer and c api. Idea being that you could get all the rich types etc from pg with the embeddability of sqlite.
I remember reading once upon a time about the sqlite file format being a good candidate for data transfer between systems/machines, as it's standardized, stable, and infinitely queryable. Has anyone here had any experience actually using it as such? What are your takeaways?
If you have only one thread/process, use it as is; if you need multiple readers and one writer (who do not block each other), use WAL mode; if you need simultaneous multiple readers and writers, use Postgres.
If you are using SQLite, be aware that the “copy the file to move it to a different system” comes with a caveat - a regular Unix “cp” may get the file in an inconsistent state if it was modified while copying, or in the middle of a transaction.
Iirc “rsync” if used to copy, will verify that the file was not modified while copying and restart the copy if it did (a number of times and then error out). But that’s still not enough - you may need to also copy the shm file (in wal mode) and the journal file - and you need to do that atomically. Which is harder still.
The right way to copy an SQLite file from place to place us:
Any way you like, if it was properly committed and closed last time it was used, and is not currently open by any other process;
Or creating a backup (e.g. with the sqlite3 command), which will be a nonblocking operation if in WAL more - and then you can ship the backup whichever way you want.
We have been using SQLite as a distributed object container for the last few years as part of an industry spec. <http://www1.semi.org/en/ritdb-interplanetary-database-manufa.... Has been working quite well on windows, OS X and Linux. Also java, r, c, js, python clients. We are using MQTT as the transport.
We use a common SQLite schema for exporting/importing very complex application traces between environments (several tables and lots of relationships). If we didn't have SQLite managing this for us, it would probably be some horrific JSON blob on the wire with serialization+LINQ on both ends blowing up RAM and crashing things.
With SQLite, it's as simple as dumping the .db file from one machine to another and running your analysis tools over it - aka executing simple SQL queries from various reporting views. We also see really good compression on this type of structured data (we just compress the entire .db file on disk).
Our analysis tools are typically just DB Browser for SQLite - A tool I would say is 10x better & faster than SSMS: https://sqlitebrowser.org
I use it as exactly this. I use a bespoke schema for a case where I need to move millions of small files around and be able to access them efficiently. It actually works out to be faster than filesystem access in my case, and so much easier than trying to manage tons of tiny files.
Recent versions of SQLite's CLI also has a "sqlar" mode [1] that lets you treat sqlite database files as an archive file. Unlike some other archive files, adding and removing files to an existing sqlar file is a first class operation, since it's just a database file. It's basically what I'm doing for my "ton of tiny files" case, just with a documented schema.
> - Lost decimal places on a simple database round-trip.
This seems like a misunderstanding of the types that SQLite was able to store natively and which one was right for your use-case. While SQLite certainly has less options out of the box than a larger database has for types, this seems like it could happen in another database as well with a wrongly chosen type (ex. integer vs bigint in postgres).
> - Issues with accessing the DB from multiple threads.
How SQLite acts under multiple thread access is well documented[0]. You can even get closer to bigger systems by using WAL mode[1]. The fact that SQLite isn't the best for concurrent writes is also discussed in the page on when to use SQLite[2]:
> Many concurrent writers? → choose client/server
> If many threads and/or processes need to write the database at the same instant (and they cannot queue up and take turns) then it is best to select a database engine that supports that capability, which always means a client/server database engine.
> SQLite only supports one writer at a time per database file. But in most cases, a write transaction only takes milliseconds and so multiple writers can simply take turns. SQLite will handle more write concurrency that many people suspect. Nevertheless, client/server database systems, because they have a long-running server process at hand to coordinate access, can usually handle far more write concurrency than SQLite ever will.
A bit of a reach but I'm also willing to bet that the architectural pattern that you were using that required multiple threads to write to the DB at the same time might benefit from passing that responsibility to a single thread which could possibly employ some batching, which is normally one of the low hanging fruits of DB write performance tuning.
Once conceived of a plan to store a lot of data per user and thought of the same idea. How do you store the dbs? Keeping it on say s3 means there's a multisecond (or longer) load time when a user logs in and you need to load the DB to the hard drive right? (I'm thinking a GB or more data per user).I considered an ec2 instance with a multi terabyte ebs attached, that can then effectively store a thousand users: data. Are there any other possibilities?
Our DBs are tiny for most users. We run s3 (minio rook ceph on k8s) locally so the network latency is a cluster latency.
I figure you could just throw hardware at it like you mentioned. Move them to nvme backed S3 if needed.
And our use case is only ever load, do a read or write, and then save. So they DBs aren’t open for very long.
And with S3 compression you could save on download time but pay a decompress cost.
This approach has its downsides don’t get me wrong but it scales nicely but forget running aggregates across the databases at least not for a real-time result.
This is called sharding[0] in the database world and can be done in many ways. It is particularly easy to achieve in sqlite because of the literally tiny overhead in terms of storing, shifting, duplicating and moving around sqlite databases relative to fully-fledged RDBS. [0] https://en.wikipedia.org/wiki/Sharding
Interesting that you had issues with floating point precision, what type were you using inside your application? SQLite's floating point type (REAL) is 64 bit.
Perhaps you don't need it anymore but for anyone else reading I've worked with some really large csv files that I don't date open in most editors with xsv.
> I used to do a fair amount of ad hoc analysis with large (more than Excel can handle is large enough) csv files. Sqlite3 was my saviour.
This was probably the right decision 5 years ago, but I'm not sure Sqllite is the proper solution for "ad hoc" analysis now. Excel now has plugins to handle very large datasources. And of course, the company leading this charge here is Tableau, with many others doing something similar.
Sqllite can still do all the manipulation that these others can, but the others can also be used by non developers and come with pretty graphs and pictures. For large CSV analysis on an ad hoc basis, SQLlite doesn't stack up well to the competition.
> As someone who has to use Excel a lot, I'm very interested in what your typical workflow looks like.
It's pretty standard Excel stuff: summing columns, averaging, functions combining multiple columns, pivot tables. The significant difference is that I routinely deal with 20k+ rows, and occasionally 100k+ rows. For simple stuff, standard Excel starts breaking down with 10k+ rows, and it even struggles with 1k rows with more complicated manipulation.
Every time I have to set up a Java runtime on Windows and click through Oracle's boast of it running on 3 (US) billion devices (which pops up on the installer) I think of how SQLite probably is at least a thousand billion in front :)
Great to see FILTER on aggregates, solves a small pain I have with my home automation stats :)
Something else I appreciate about SQLite: it proves that under the ultimate free software distribution scheme -- public domain -- corporate users contribute improvements back to the project.
I wish SQLite would move to github with issue tracking. Digging trough their issues to find if my particular issue is reported is not as pleasant as with github issues.
Specifically I’m now wondering if a case-insensitive replace() string function is already a feature request or perhaps even implemented in a recent version.
How does SQLite compare to say BerkeleyDB or LevelDB, which I think compete in this same arena no?
I’ve used SQLite and like it a lot but I’ve recently started using these two for some pet projects and found them easier to work with and in theory they should scale easier if I understand their strengths
It depends on your use case. LevelDB is way too sloppy about memory allocation, threads and I/O, which may not be acceptable for an embeddable database. I got to the point once where every daemon on every node using LevelDB had to be restarted daily from cron.
> SQLite4 was an experimental rewrite of SQLite that was active from 2012 through 2014. All development work on SQLite4 has ended. Lessons learned from SQLite4 have been folded into the main SQLite3 product. SQLite4 was never released. There are no plans to revive it. You should be using SQLite3.
If all this doesn't convince you to use SQLite, it's also one of the most well documented large C codebases that is fantastic to learn from.
I'd go as far as to say that many modern startups could get pretty far with SQLite + aggressive caching before even bringing in a big database like Postgres (though with recent deployment/ops advancements it's easier than ever to run postgres).
[0]: https://sqlite.org/mostdeployed.html
[1]: https://www.sqlite.org/fts5.html
[2]: https://sqlite.org/fts3.html
[3]: https://sqlite.org/lang_with.html
[4]: https://www.sqlite.org/json1.html
[5]: https://www.sqlite.org/loadext.html
[6]: https://www.sqlite.org/whentouse.html