Hacker News new | past | comments | ask | show | jobs | submit login

i would seriously consider sqlite-utils here.

https://sqlite-utils.datasette.io/en/stable/cli.html




Was going to post the same thing, I suspect converting the dataset to a SQLite db would be infinitely more fast and productive than pecking away at it with pandas and such.


At this size, I doubt it. While SQLite can read JSON if compiled with support for it, it stores it as TEXT. The only native indexing possible for that that I'm aware of is full-text search, and I suspect the cardinality of JSON characters would make that inefficient. Not to mention that the author stated they didn't have enough memory to store the entire file, so with a DB you'd be reading from disk.

MySQL or Postgres with their native JSON datatypes _might_ be faster, but you still have to load it in, and storing/indexing it in either of those is [0] its own [1] special nightmare full of footguns.

Having done similar text manipulation and searches with giant CSV files, parallel and xsv [2] is the way to go.

[0]: https://dev.mysql.com/doc/refman/8.0/en/json.html

[1]: https://www.postgresql.org/docs/current/datatype-json.html

[2]: https://github.com/BurntSushi/xsv


The trick sqlite-utils uses is to automatically create a SQLite table schema that matches the shape of the JSON object - specifically of you give it a JSON array of objects.

It doesn't handle streaming JSON out of the box though, so you'd need to write some custom code on top of something like ijson to avoid loading the entire JSON file into memory first.


SQLite is great for datasets that fit comfortably into memory, but otherwise it starts to struggle.


Come again? It is a database. The entire raison-d'etre is for managing bigger-than-memory data. On the SQLite forum[0], Dr. Hipp indicated a customer approached them that they were using a ~40 TB SQLite database. Now, you could argue it is less good at that scale, particularly analytical workloads, but SQLite has an enormous amount of headroom that you are unlikely to encounter.

[0]: https://sqlite.org/forum/info/6e13e1911f1fe39c


That's not accurate: SQLite works great with data that's too big to fit in memory provided you create indexes and write queries that can take advantage of them - just like any other relational database (MySQL/PostgreSQL etc).


I guess it depends on what you’re trying to do. Creating those indices is going to require a scan through the whole data set, and you’re going to need to run a fair number of queries to amortize that. And if I’m going to be querying this data over and over again, that raises the question of why it wasn’t in a relational database to begin with.

In my personal experience I’m usually digging through structured logs to answer one or two questions, after which point I won’t need the exact same data set to be indexed the exact same way again. That’s often more easily done by converting the data to TSV and using awk and other command line tools, which is typically quicker and more parallelizable than loading the whole works into SQLite and doing the work there.


happily, i have multiple gigabytes of memory …


Sure, but a 40 GB SQLite database on a machine with 16 GB of RAM is not gonna be happy


You're not going to do better with pandas or similar tools. If it can't fit in memory, it's going to be painful. SQLite is the least painful in my experience, and it sets you up for working with the data in a proper DB like postgres or similar for when you get fed up with the memory constraints.


I wouldn’t use pandas in that situation either.


In the original article, they were trying to read a 10GB file into 30GB RAM.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: