I'm not saying ALWAYS use SQLite for these cases, but in the right scenario it can simplify things significantly.
Another similar use case would be AI/ML models that require a bunch of data to operate (e.g. large random forests). If you store that data in Postgres, Mongo or Redis, it becomes hard to ship your model alongside with updated data sets. If you store the data in memory (e.g. if you just serialize your model after training it), it can be too large to fit in memory. SQLite (or other embedded database, like BerkleyDB) can give the best of both worlds-- fast random access, low memory usage, and easy to ship.
With the right pragmas it is both faster and more compact than JSON.
It is also much more "human readable" than gigabytes of JSON.
I only wish there was a way to open an http-fetched SQLite database from memory so I don't have to write it to disk first.
The sqlite3_deserialize() interface was created for this very purpose. https://www.sqlite.org/c3ref/deserialize.html
Also check https://github.com/lmatteis/torrent-net
$ mount -t tmpfs none /some/path
$ write db.sqlite /some/path/db.sqlite
$ read db.sqlite
Could you talk more about what Pragmas you’ve been using and why?
I'm still remembering old-school ramdisks under Linux which were finite in both number and size, both to quite small extents. I think there were 8 (or 12 or 16?) total ramdisks available, of only 2-4 MB each, configurable with LILO boot options.
That's now ... mostly taking up valuable storage in my own brain for no useful effect.
(A prime validation for answering questions, BTW.)
My first read was that the old-school ramfs / ramdisk limitations still held. I can't actually even find documentation on them, though I'm pretty sure I'm not dreaming this.
Circa 2.0 kernal IIRC, possibly earlier.
OK, some traces remain, see:
Note that this is OBSOLETE information.
Figuring out how to enter "sql" mode in lnav, generate a logfile table, and then persist it from an in-memory sqlite db to a saved-to-disk sqlite db .... was frustratingly annoying.
It boils down to:
;ATTACH DATABASE `test02.db` AS bkup;
;create table bkup.custom_log as select * from custom_log;
;detach database bkup;
I construct the .sqlite database from scratch each time in Python, building out table after table as I like it.
Some configuration data is loaded in from files first. This could be some default values or even test records for later injection.
The input data is loaded into the appropriate tables and then indexed as appropriate (or if appropriate). It is as "raw" as I can get it.
Each successive transformation occurs on a new table. This is so I can always go back one step for any post-mortem if I need to. Also, I can reference something that might be DELETEd in an a later table.
Often (and this is task-dependent), I will have to pull in data from other server-based databases, typically the target. They get their own tables. Then I can mark certain records as not being present in the target database, so they must be INSERTed. If a record is not present in my input and is there in the target, that would suggest a DELETE. Finally, I can compare records where some ID is present in my input and my .sqlite, they might be good for an UPDATE. All of this is so I can make only the changes that need to be made. Speed is not important to me here, only understanding what changes needed to be made and having a record of what they were and why.
I am happy to say that an ETL process I wrote using this general method back around 2009 is probably still running. I haven't had to touch it in years. Occasionally I will receive questions as to "why did this happen?" and I can just start running queries on the resultant .sqlite database file, kept with the logs, for answers.
Similarly, I can use these sorts of techniques when I am analyzing other datasets. The value here is that I can just refresh one table when the relevant data comes in, rather than having to run the ingest process for everything all over again. This can save me a lot of time.
I have gotten some CS students who were about to shoot flies with various cannons turned on to SQLite. I kept a couple of the decent books about it nearby and would shove it into their hands at that point. Usually a week later they would be raving about it.
I've been working on doing similar with containerized dababase servers for testing, while still having versioned scripts for prod (multiple separate deployments).
In the early stages of development of whatever the ETL process is, I keep the database and just empty it out each time. As I got more of a sense of what I needed, I started DROPing my TABLEs more often and remaking them. Eventually I would make the whole database from scratch once I was along the way and had most everything fleshed out.
Do you anticipate hitting a wall at some point where the total time becomes a problem?
No, the time never grew significantly.
For one of the analysis projects, just one step of the analysis was quite time consuming but it would have been that way no matter what. SQLite allowed me to let it grind away overnight (or even over a weekend) on a workstation without tormenting production servers.
We initially used json but ran in to memory issues; sqlite is more memory efficient and being able to use SQL instead of the wild SQL-esque is both faster and more reliable.
I do not think LMDB could load from in-memory only object (as it has to have file to memory-map to), however.
But same design reasons, I wanted something that
a) I can move across host architectures
b) something that can act as key-val cache, as soon as the processes using it are restarted (so no cache hydrating delay)
c) something that I can diff/archive/restore/modify in place
We tested sqllite for the above purpose at the time, and writing speed and ( b ) - lmdb was significantly faster.
So we lost the flexibility of SQLite, but I felt it was a reasonable tradeoff, given our needs.
I also know that one of the Intel's python toolkits for image recognition/ai, uses LMDB (optionally) store images that processing routines do not have incur the cost of directory lookups when touching millions of small images.
(forgot the name of the toolkit though)…
Overall, this a very valid practice/pattern in data processing pipelines, kudos to you for mentioning it.
We get a gnarly csv log file back from our sensors in the field, which is really a "flattened" relational data model. What I mean by that is a file with "sets" of records of various lengths, all stacked on top of each other. So, if you open it in Excel, (which many users do), the first set of 50 rows may be 10 columns wide, the next 100 rows will be 20 columns wide, the next 45 wide, etc. And, the columns for each of these record sets have different names and data types.
Converting to JSON is obvious, but I've thought about just creating a SQLite file with tables for each of the sets of records. Then, as others have said, can use one of any number to tools to easily query/examine the file. Also can easily import into a pandas data frame.
One concern is file size. Any comments on this? I can try it, but wonder if anyone knows off the top of their heads if a large JSON file converted to an SQLLite file would be a lot larger or smaller?
You only have to read the CSV file once, and after that you have a nice set of tables you can query any which way you want.
I use SQLite as an intermediate step between text files and static HTML, for example.
> The SQLite file format is cross-platform. A database file written on one machine can be copied to and used on a different machine with a different architecture. Big-endian or little-endian, 32-bit or 64-bit does not matter. All machines use the same file format. Furthermore, the developers have pledged to keep the file format stable and backwards compatible, so newer versions of SQLite can read and write older database files.
Our current setup is having all our services in kubernetes but our databases in stateful VMs. I do occasionally stuff job-reports and similar data into postgres rows since it's already there, but I've been unhappy with our ETL setup and would be interested in hearing techniques to improve it.
I'm in favor of leveraging ISP dbaas and persistence offerings over trying to home grow something. It just depends on where you are coming from and/or what you are trying to do... K8s alone avoids so much lock in, and as long as whatever storage option (container mount) or dbaas you use is portable, I don't think it's so bad in either case.
ETL = extract, transform, load
"... causes them to accumulate output as Comma-Separated-Values (CSV) in a temporary file, then invoke the default system utility for viewing CSV files (usually a spreadsheet program) on the result. This is a quick way of sending the result of a query to a spreadsheet for easy viewing"