Hacker News new | past | comments | ask | show | jobs | submit login

I think a good under-appreciated use case for SQLite is as a build artifact of ETL processes/build processes/data pipelines. Seems like lot of people's default, understandably, is to use JSON as the output and intermediate results, but if you use SQLite, you'd have all the benefits of SQL (indexes, joins, grouping, ordering, querying logic, and random access) and many of the benefits of JSON files (SQLite DBs are just files that are easy to copy, store, version, etc and don't require a centralized service).

I'm not saying ALWAYS use SQLite for these cases, but in the right scenario it can simplify things significantly.

Another similar use case would be AI/ML models that require a bunch of data to operate (e.g. large random forests). If you store that data in Postgres, Mongo or Redis, it becomes hard to ship your model alongside with updated data sets. If you store the data in memory (e.g. if you just serialize your model after training it), it can be too large to fit in memory. SQLite (or other embedded database, like BerkleyDB) can give the best of both worlds-- fast random access, low memory usage, and easy to ship.




I have been using SQLite as a format to move data between steps in a complicated batch processing pipeline.

With the right pragmas it is both faster and more compact than JSON. It is also much more "human readable" than gigabytes of JSON.

I only wish there was a way to open an http-fetched SQLite database from memory so I don't have to write it to disk first.


> I only wish there was a way to open an http-fetched SQLite database from memory so I don't have to write it to disk first.

The sqlite3_deserialize() interface was created for this very purpose. https://www.sqlite.org/c3ref/deserialize.html


If the language's sqlite bindings don't offer a way to load a database from a string, if you're on a modern linux kernel (3.17+) you can make use of the memfd_create syscall: it creates an anonymous memory-backed file descriptor equivalent to a tmpfs file, but no tmpfs filesystem needs to be mounted and there's no need to think about file paths.


You can use the memvfs module to load in-memory databases if you're using the C API. I'm not sure how many higher-level APIs support it though.

[1] https://stackoverflow.com/a/53453338/3063 [2] https://www.sqlite.org/loadext.html#example_extensions [3] https://www.sqlite.org/src/file/ext/misc/memvfs.c


A very interesting approach is sqltorrent (https://github.com/bittorrent/sqltorrent): the sqlite file is shared in a torrent, and all queries will touch a specific part of the file, which is downloaded on-demand.

Also check https://github.com/lmatteis/torrent-net


Incredibly odd, but so awesome


  $ mount -t tmpfs none /some/path
  $ write db.sqlite /some/path/db.sqlite
  $ read db.sqlite

We've been abusing tmpfs for more than 10 years to get around the IO layer's failings. It's probably still a valid pattern.


This is a amazing, I think you may have just solved and headed-off a huge number odd problems for me.

Could you talk more about what Pragmas you’ve been using and why?


Not the OP, but I find `PRAGMA synchronous = OFF` makes the creation of DBs vastly faster ...


> I only wish there was a way to open an http-fetched SQLite database from memory so I don't have to write it to disk first.

Ramfs?


tmpfs is the better-behaved option should you run out of resources, see:

https://www.jamescoyle.net/knowledge/951-the-difference-betw...

I'm still remembering old-school ramdisks under Linux which were finite in both number and size, both to quite small extents. I think there were 8 (or 12 or 16?) total ramdisks available, of only 2-4 MB each, configurable with LILO boot options.

That's now ... mostly taking up valuable storage in my own brain for no useful effect.


It looks like a good intro, thanks. I wasn't aware of these technologies, but I knew it was possible to build an FS in RAM. So I just put these two keywords together.


FWIW, I learned a few things researching my answer.

(A prime validation for answering questions, BTW.)

My first read was that the old-school ramfs / ramdisk limitations still held. I can't actually even find documentation on them, though I'm pretty sure I'm not dreaming this.

Circa 2.0 kernal IIRC, possibly earlier.

OK, some traces remain, see:

https://www.tldp.org/HOWTO/Bootdisk-HOWTO/x1143.html

Note that this is OBSOLETE information.


What pragmas do you use? It sounds amazing!


Using SQLite in my ETL processes is something I have done for over a decade. It's just so convenient and, at the end, I have this file that can be examined and queried to see where something might have gone wrong. All of my "temporary" tables are right there for me to look at. It is wonderful!


Yes! Along these lines I heartily recommend `lnav` ^1, a fantastic, lightweight, scriptable CLI mini-ETL tool w embedded sqlite engine, ideally suited for working with moderately-sized data sets (ie, millions of rows not billions) ... so useful!

1. https://lnav.org


I have used it to inspect say the history of a users' requests on a load-balanced server. I like to permanently store the results of the logfile excerpt to a DB table for posterity and future reporting.

Figuring out how to enter "sql" mode in lnav, generate a logfile table, and then persist it from an in-memory sqlite db to a saved-to-disk sqlite db .... was frustratingly annoying.

It boils down to:

    :create-logline-table custom_log
    ;ATTACH DATABASE `test02.db` AS bkup;
    ;create table bkup.custom_log as select * from custom_log;
    ;detach database bkup;
if i recall you cannot call sqlite commands ".backup" or similar in lnavs sql mode. So lnavs interjection into the sqlite command processing is annoying (I'm actually very familiar with sqlite).


Would you mind elaborating on your ETL process a little more? Im a junior DE and curious about how I would implement this


It's pretty straightforward, really.

I construct the .sqlite database from scratch each time in Python, building out table after table as I like it.

Some configuration data is loaded in from files first. This could be some default values or even test records for later injection.

The input data is loaded into the appropriate tables and then indexed as appropriate (or if appropriate). It is as "raw" as I can get it.

Each successive transformation occurs on a new table. This is so I can always go back one step for any post-mortem if I need to. Also, I can reference something that might be DELETEd in an a later table.

Often (and this is task-dependent), I will have to pull in data from other server-based databases, typically the target. They get their own tables. Then I can mark certain records as not being present in the target database, so they must be INSERTed. If a record is not present in my input and is there in the target, that would suggest a DELETE. Finally, I can compare records where some ID is present in my input and my .sqlite, they might be good for an UPDATE. All of this is so I can make only the changes that need to be made. Speed is not important to me here, only understanding what changes needed to be made and having a record of what they were and why.

I am happy to say that an ETL process I wrote using this general method back around 2009 is probably still running. I haven't had to touch it in years. Occasionally I will receive questions as to "why did this happen?" and I can just start running queries on the resultant .sqlite database file, kept with the logs, for answers.

Similarly, I can use these sorts of techniques when I am analyzing other datasets. The value here is that I can just refresh one table when the relevant data comes in, rather than having to run the ingest process for everything all over again. This can save me a lot of time.


Awesome - elegantly simple using very common technologies.


I am not a very talented programmer so I stick very close to what is common, standard, and easy to understand. It usually means I am on the downslope of the hype cycle and it limits some opportunities but I have become okay with that.

I have gotten some CS students who were about to shoot flies with various cannons turned on to SQLite. I kept a couple of the decent books about it nearby and would shove it into their hands at that point. Usually a week later they would be raving about it.


Do you still have the titles of those books at hand? I'd love to take a look at them.


They are The Definitive Guide to SQLite by Mike Owens and Using SQLite by Jay A. Kreibich. I am quite sure they are more book than I needed, I only plumbed a fraction of SQLite's immense capabilities.


Do you generate the file from scratch every time or do you modify the previous one as new data arrives?


Depends on what you want... if you have a separate db project, you can have the output of that project be a clean database for testing other things, or a set of migration scripts for existing deployments.

I've been working on doing similar with containerized dababase servers for testing, while still having versioned scripts for prod (multiple separate deployments).


It is a bit of a hybrid.

In the early stages of development of whatever the ETL process is, I keep the database and just empty it out each time. As I got more of a sense of what I needed, I started DROPing my TABLEs more often and remaking them. Eventually I would make the whole database from scratch once I was along the way and had most everything fleshed out.


Ok. So each export is a full dump, not a delta on a previous one.

Do you anticipate hitting a wall at some point where the total time becomes a problem?


Well, it depends on the process. Some were full dumps, some were deltas pushed up to the final database, sometimes both (this product in particular had a load from file capability that you were supposed to use but some edge cases that were not well-addressed).

No, the time never grew significantly.

For one of the analysis projects, just one step of the analysis was quite time consuming but it would have been that way no matter what. SQLite allowed me to let it grind away overnight (or even over a weekend) on a workstation without tormenting production servers.


We do something like this; one of the outputs of the data pipeline is an sqlite file that's deployed nightly along with code to App Engine. The sqlite stuff is all read only, read/write data for the app is stored in firestore instead.

We initially used json but ran in to memory issues; sqlite is more memory efficient and being able to use SQL instead of the wild SQL-esque is both faster and more reliable.


Yes, I have been doing same thing, only with LMDB.

I do not think LMDB could load from in-memory only object (as it has to have file to memory-map to), however.

But same design reasons, I wanted something that

a) I can move across host architectures

b) something that can act as key-val cache, as soon as the processes using it are restarted (so no cache hydrating delay)

c) something that I can diff/archive/restore/modify in place

We tested sqllite for the above purpose at the time, and writing speed and ( b ) - lmdb was significantly faster.

So we lost the flexibility of SQLite, but I felt it was a reasonable tradeoff, given our needs.

I also know that one of the Intel's python toolkits for image recognition/ai, uses LMDB (optionally) store images that processing routines do not have incur the cost of directory lookups when touching millions of small images. (forgot the name of the toolkit though)…

Overall, this a very valid practice/pattern in data processing pipelines, kudos to you for mentioning it.


"wild SQL-esque" should have been "wild SQL-esque thing I wrote to query the JSON"


I've wondered about this too, but have not gotten around to trying it yet.

We get a gnarly csv log file back from our sensors in the field, which is really a "flattened" relational data model. What I mean by that is a file with "sets" of records of various lengths, all stacked on top of each other. So, if you open it in Excel, (which many users do), the first set of 50 rows may be 10 columns wide, the next 100 rows will be 20 columns wide, the next 45 wide, etc. And, the columns for each of these record sets have different names and data types.

Converting to JSON is obvious, but I've thought about just creating a SQLite file with tables for each of the sets of records. Then, as others have said, can use one of any number to tools to easily query/examine the file. Also can easily import into a pandas data frame.

One concern is file size. Any comments on this? I can try it, but wonder if anyone knows off the top of their heads if a large JSON file converted to an SQLLite file would be a lot larger or smaller?

edit: clarity


Yes, it is great for that.

You only have to read the CSV file once, and after that you have a nice set of tables you can query any which way you want.

I use SQLite as an intermediate step between text files and static HTML, for example.


I was under the impression SQLite files were not supposed to be moved across architectures.


Thankfully that's not the case:

> The SQLite file format is cross-platform. A database file written on one machine can be copied to and used on a different machine with a different architecture. Big-endian or little-endian, 32-bit or 64-bit does not matter. All machines use the same file format. Furthermore, the developers have pledged to keep the file format stable and backwards compatible, so newer versions of SQLite can read and write older database files.

https://www.sqlite.org/different.html


How does this work with container/ephemeral services such as typical K8s deployments? Can I trust the file system mounting via resources like StatefulSets or FS mounts? For that matter, Heroku, App Engine, Cloud Functions, whatever?

Our current setup is having all our services in kubernetes but our databases in stateful VMs. I do occasionally stuff job-reports and similar data into postgres rows since it's already there, but I've been unhappy with our ETL setup and would be interested in hearing techniques to improve it.


ETL workers themselves are typically ephemeral, plumbing batches between remote storage systems like Postgres, S3, and Hive. You might use local disk as scratch space during the batch, but not as a sink.


From experience, and supported by the Sqlite docs, I can tell you that trying to run sqlite on files on an NFS mounted filesystem will not work. See section 2.1 of this document [1] and the related discussion HN discussion [2]

[1] https://www.sqlite.org/howtocorrupt.html

[2] https://news.ycombinator.com/item?id=22098832


Use a volume container mounted against a persistent storage engine on the node and do pod mounting from those containers. Stateful VMs are often a better choice for production imo.

I'm in favor of leveraging ISP dbaas and persistence offerings over trying to home grow something. It just depends on where you are coming from and/or what you are trying to do... K8s alone avoids so much lock in, and as long as whatever storage option (container mount) or dbaas you use is portable, I don't think it's so bad in either case.


I don't know what is "ETL" meaning here, although SQLite does include a JSON extension to read/write JSON data too, so you can use SQL and JSON together if necessary.


It's an acronym for munging some data

ETL = extract, transform, load


Yes, if that is what you are trying to do, I think SQLite is good. SQLite command shell also has a .import command to read data from a file, and you can also import into a view and use triggers to process the data (this is something I have done). And there is also functions and virtual tables for JSON, and you can load extensions (written in C) to add additional functions, virtual tables, collations, etc. So for many cases, SQLite is useful.


this is inspiring, i cannot believe i had not considered this before!


Beware opening SQLite files you didn't create: https://research.checkpoint.com/select-code_execution-from-u...


I’d love a SQLite to macOS Excel (or any macOS spreadsheet application) workflow so less technical users can do analysis. Has anybody pulled this off?


You mean like the .excel command?

"... causes them to accumulate output as Comma-Separated-Values (CSV) in a temporary file, then invoke the default system utility for viewing CSV files (usually a spreadsheet program) on the result. This is a quick way of sending the result of a query to a spreadsheet for easy viewing"


Or load it in Metabase (as an macOS app).


You can do that with powerquery


Terrific! Data pipelines I've built have had JSON as their intermediary steps which I'm growing weary of.




Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: