i would seriously consider sqlite-utils here. https://sqlite-utils.datasette.io/...

qbasic_forever · on March 18, 2023

Was going to post the same thing, I suspect converting the dataset to a SQLite db would be infinitely more fast and productive than pecking away at it with pandas and such.

sgarland · on March 18, 2023

At this size, I doubt it. While SQLite can read JSON if compiled with support for it, it stores it as TEXT. The only native indexing possible for that that I'm aware of is full-text search, and I suspect the cardinality of JSON characters would make that inefficient. Not to mention that the author stated they didn't have enough memory to store the entire file, so with a DB you'd be reading from disk.

MySQL or Postgres with their native JSON datatypes _might_ be faster, but you still have to load it in, and storing/indexing it in either of those is [0] its own [1] special nightmare full of footguns.

Having done similar text manipulation and searches with giant CSV files, parallel and xsv [2] is the way to go.

[0]: https://dev.mysql.com/doc/refman/8.0/en/json.html

[1]: https://www.postgresql.org/docs/current/datatype-json.html

[2]: https://github.com/BurntSushi/xsv

simonw · on March 18, 2023

The trick sqlite-utils uses is to automatically create a SQLite table schema that matches the shape of the JSON object - specifically of you give it a JSON array of objects.

It doesn't handle streaming JSON out of the box though, so you'd need to write some custom code on top of something like ijson to avoid loading the entire JSON file into memory first.

philwelch · on March 18, 2023

SQLite is great for datasets that fit comfortably into memory, but otherwise it starts to struggle.

fbdab103 · on March 18, 2023

Come again? It is a database. The entire raison-d'etre is for managing bigger-than-memory data. On the SQLite forum[0], Dr. Hipp indicated a customer approached them that they were using a ~40 TB SQLite database. Now, you could argue it is less good at that scale, particularly analytical workloads, but SQLite has an enormous amount of headroom that you are unlikely to encounter.

[0]: https://sqlite.org/forum/info/6e13e1911f1fe39c

simonw · on March 18, 2023

That's not accurate: SQLite works great with data that's too big to fit in memory provided you create indexes and write queries that can take advantage of them - just like any other relational database (MySQL/PostgreSQL etc).

philwelch · on March 19, 2023

I guess it depends on what you’re trying to do. Creating those indices is going to require a scan through the whole data set, and you’re going to need to run a fair number of queries to amortize that. And if I’m going to be querying this data over and over again, that raises the question of why it wasn’t in a relational database to begin with.

In my personal experience I’m usually digging through structured logs to answer one or two questions, after which point I won’t need the exact same data set to be indexed the exact same way again. That’s often more easily done by converting the data to TSV and using awk and other command line tools, which is typically quicker and more parallelizable than loading the whole works into SQLite and doing the work there.

hprotagonist · on March 18, 2023

happily, i have multiple gigabytes of memory …

philwelch · on March 18, 2023

Sure, but a 40 GB SQLite database on a machine with 16 GB of RAM is not gonna be happy

qbasic_forever · on March 18, 2023

You're not going to do better with pandas or similar tools. If it can't fit in memory, it's going to be painful. SQLite is the least painful in my experience, and it sets you up for working with the data in a proper DB like postgres or similar for when you get fed up with the memory constraints.

philwelch · on March 18, 2023

I wouldn’t use pandas in that situation either.

kosherhurricane · on March 18, 2023

In the original article, they were trying to read a 10GB file into 30GB RAM.