Hacker News new | past | comments | ask | show | jobs | submit login
You Shouldn't Use SQLite (hendrik-erz.de)
5 points by Existenceblinks on April 28, 2022 | hide | past | favorite | 11 comments



> would be to store about 16GB of research data efficiently since it was becoming extremely slow to work with on her own computer. Almost immediately the idea was brought up to simply drop everything into an SQLite database. And to that I immediately had to respond “No!”

> The data will now (probably) be transferred to a sharded MySQL database cluster and that will speed up almost anything by large margins.

A sharded MySQL cluster for 16GB of data?!!


When we are already at the point of giving advice: use postgres instead of MySQL.

Anyway the article is actually not very good. If you are able to put your data in a managed db with backups and stuff, this is mostly better anyway than playing around with some local data but plenty of people don't have access to a managed db service.

And no 16gb is nothing. Go get more ram for 50 bucks and forget about it.


> Since SQLite files are single, continuous files on your file system, they have to be stored on one, physical hard drive.

But you can use things like RAID or LVM to have a filesystem bigger than any of the physical hard drives that it's spread across.

> File systems have been built with reasonable file sizes in mind. That means: The limit for the files are even smaller than the limits for hard drive sizes.

Btrfs and exFAT both support files of up to 16 EiB (16,777,216 TiB). No hard drives even come close to this size. For comparison, the total capacity of an AWS Snowmobile (a shipping container full of hard drives) is only about 0.1 EiB.


Author needs more education/experience on setting up and using a db.

Headline writers need more nuance.


So, they know they have the wrong index, but they don't know how to CREATE INDEX?


It's strange article, so I posted here to seek out what's the problem. If they search for non-index column it's very likely to do table scan.


And the cost of that table scan could have been reduced if the table they had to scan over didn't contain the large text column which has the contents of their files.


Why can't you just put an index on the filename?

I've never worked with that much data myself, but a lot of people seem to be doing fine with 10+ GB sqlite databases.


> I work with the Congressional Records dataset, that is: Everything that has been said or done on the floor of the U.S. Congress since 1873.

I guess this is probably one of the big cases for no sqlite, as your db is more a record of texts than a columnar db. Or am I missing something?


[2021]

I have a few concerns about this article. First, a summarization.

This is the event which raised the author's concern:

> Caifan Du asked what the best way would be to store about 16GB of research data efficiently since it was becoming extremely slow to work with on her own computer. Almost immediately the idea was brought up to simply drop everything into an SQLite database.

The author raises two points. First, the author notes that SQLite's purported max DB size (281 TB) is not worth talking about because there is no single hard disk big enough to hold 281 TB. Next, the relates an experience: They downloaded ~600,000 text files of the Congressional Record (https://www.congress.gov/congressional-record), for the period from 1994 up to & including May 2021, into a single folder of his (presumably macOS) system. Opening the folder caused the Finder to hang. He then stored all of the files into a single SQLite database.

His code, which he previously had accessing files in the directory, is now taking many times longer to run when it accesses 'files' through the SQLite database. It turned out that the SQLite database was being queried by filename (which was one of the columns), but that column was not indexed. Since there was no index, SQLite was doing a full table scan on every 'open file' query.

I would like to respond to both points, since this is at least partially in my wheelhouse (thanks to my day job).

Mr. Erz, you are correct that there is no single drive that can hold 281 TB. But, even with macOS, it would be possible to directly connect enough storage to hold 281 TB. For example, starting with a Mac with Thunderbolt 3, you could…

• Connect via Thunderbolt to a ATTO ThunderLink SH 3128 SAS adapter

• Connect via two SAS cables to a Colfax CX42412s-JBOD 24-drive JBOD, containing 24 HGST 0F38357 16-TB SAS drives.

You can then use macOS RAID Assistant to format the drives RAID 0, giving you ~384 TB raw capacity, or ~349 TiB raw capacity, which should be enough to hold 281 TB.

I should say, it would be very unlikely that someone would do this. Instead, that JBOD would probably be connected to a server, which would then server those ~349 TiB (raw) to your computer over a network protocol like SMB. There are also network filesystems like Lustre, which use clusters of servers, each with its own direct-attached storage. In such an environment, software running on the clients takes commands like `list directory`, connects to and queries the relevant servers, and returns the results. My colleague Stéphane runs a service (Oak, https://uit.stanford.edu/service/oak-storage) which uses Lustre, and has a capacity of many Petabytes.

There may be other reasons why having a single 281 TB SQLite database is a bad idea, but that is out of scope in this case.

Next, a side comment on storing 600,000 files in a single directory. On the systems I run at work, such a directory would also take a long time to list. That is common for most environments where the `list contents of directory` operation is synchronous; the client (you) will have to wait for the OS to gather the information and organize it. Languages like Python also see this issue; it is the reason why `os.scandir` is often better to use than `os.listdir` (see https://stackoverflow.com/questions/59749854/how-does-os-lis...).

Finally, storing 'files' into a SQLite database. SQLite actually ships a tool for making "SQLite archive" files, which are what you describe: A SQLite database with a single table, containing each file's contents in a column. The `sqlar` schema is described in https://www.sqlite.org/sqlar.html, and it addresses the concern you raised in your article: The `sqlar` schema has the filename as a primary key. Doing so does not eliminate the integer row ID, but it does automatically create an index on the filename; it also ensures that filenames are unique.

`sqlar` also has another benefit, something you wanted : The file contents are stored in compressed format. The database itself is not compressed, but the file is.

In summary, you are correct that no single hard drive would store so much data, but even on macOS it is (theoretically) possible to have that much data directly connected, and it is practically possible with a SMB connection to a storage server, not to mention network file system platforms like Lustre. As for SQLite, the problem of not indexing is a vexing one, and a problem that database developers will always encounter (cf https://news.ycombinator.com/item?id=31170370; for this context, sharding may be thought of as another form of indexing). But please do not "throw the baby out with the bath water". The SQLite team themselves document and implement a schema you could have used. I ask you to reëvaluate your dislike of SQLite, and if you have some time, maybe try again, using the `sqlar` schema this time :-)

~ Karl

N.B. You may also be thrilled to learn that pandas now supports reading a SQL query into a DataFrame: https://pandas.pydata.org/pandas-docs/version/0.15.0/generat...


Especially for app configuration




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: