
Ask HN: Best way to do heavy csv processing? - laxentasken
I got a couple of big csv files (~5-10GB with millions of rows) that needs to be processed (linked to older files and updating the data etc) and then exported to new csv files.<p>The data follows the relation model but just updating one field after dumping it into postgresql takes quite some time (doing a update on join) and I&#x27;m not sure this is the most effective tool&#x2F;way for this kind of work. The only queries that will be run is to update or doing inserts&#x2F;append new data to existing tables (eg older files).<p>Do you have any suggestions to look into for a workload like this?
======
cypherdtraitor
I have done this several times now. CSV just plain sucks.

1\. Rip all CSV data into SQLite or another tabular database

2\. Do all data manipulations by shifting information between the database and
memory. Ideally you pull entire columns at a time. 95% of your runtime is
going to be spent pulling and pushing data, so minimize the number of calls
however possible.

3\. Export the database to CSV and ship it back to the customer.

If you use a particular language a lot, it is worth it to write a text scanner
that uses low level APIs to read large CSV files quickly. I usually pipe a
million characters at a time, submit most of them to the database, the duct
tape the last few characters to the next million that I pull.

------
jondegenhardt
Not a specific approach to your problem, but a resource that may be useful is
[https://github.com/dbohdan/structured-text-
tools](https://github.com/dbohdan/structured-text-tools). It lists an
extensive set of command line tools useful for working with this class of
files.

------
geophile
This is not a lot of data. Without trying hard, you should be able to import
the data within an hour. After that, it's basic postgres optimization. But
it's hard to offer advice with such a vague description of the update you are
attempting.

~~~
laxentasken
Yeah, sorry about that. Well the file will only contain changes that need to
be reflected in the old file (which servers as a master). Usually the file
contains around 120 fields which then need to be updated. Besides that I will
need to update around 150 other fields that serves as descriptions of the
original fields.

After that there is some appends of data to be done which usually is a couple
of million rows.

