Hacker News new | past | comments | ask | show | jobs | submit login
How To Open and Manipulate Large CSV Files On A Mac (alecdibble.com)
29 points by alecdibble on Nov 5, 2019 | hide | past | favorite | 53 comments



The counter-intuitive part is that a 100MB file is considered large on a machine with 8-16GB RAM.

It's definitely not the problem of the resources, but only of the architecture of applications.

(Indeed, the right application is a DBMS, not a spreadsheet.)


You could get interested in:

https://github.com/BurntSushi/xsv


Thank you for linking this. It seems to be much more intuitive than awk, especially for his particular purpose. I wish I would have found this a few months ago when I was slicing and dicing lots of data for an extensive system migration project.


nice. long time ago I had much success importing and filtering multiple gigabytes of CSV data into elasticsearch using CSVfix (if I'm not mistaken) + jq (converting it to json-lines in between, using jq as well) xsv seems to cover some areas I've used jq.


When I saw "large", I expected >10GB. And "big data" if it needs to be batch processed or distributed processed.


No mention of vi or Sqlite? While I'm no vi expert it's a great tool for working with big files when you want to browse around without grep. And Sqlite is similarly ubiquitous and capable of crunching large files.


vi is an editor so it doesn't really solve my requirement of getting spreadsheet-like editing capabilities. SQlite is a good idea, I never thought of that. I will investigate that and add it to the article. Thanks!


SQLite does a great job of importing CSV, and then you're able to use something like "DB Browser for SQLite"[1] to browse the data.

I use something like this to fix up column names and import:

    sed -i '' -e '1 s/ \/ /_/g; 1 s/[() ]/_/g' $csv_file
    sqlite3 $db_file <<SQL
    .mode csv
    .import $csv_file $table_name
    SQL
[1] https://sqlitebrowser.org/


Amazing, I will definitely try this out. Thank you!


Two of my favorite tools for this kind of thing:

http://visidata.org/

And

http://recs.pl/


> Excel for Mac performed well but it is a paid solution so I did not consider it viable for many developers

Because developers handling gigabyte size data, and wanting to reliably manipulate it in a GUI, cannot possibly be expected to pay/afford the $7/month to Microsoft.

That said, the recommended solution is probably the best option for developers, not bedside because it's free, but for the ability to run complex SQL statements, and visualize the results.

If I were to edit this article, that'd be my takeaway: use tool X for snappy vitalization of SQL queries, even on multi gigabyte sized CSVs


> cannot possibly be expected to pay/afford the $7/month to Microsoft.

Especially after paying $2000+ for a mac.


I wonder how well would Table Tool [1] perform with your large dataset? This is an open source CSV editor for Mac from the developer of Postico, my favorite PostgreSQL client for Mac [2]

[1] https://github.com/jakob/TableTool

[2] https://eggerapps.at/postico/


Python + Pandas


Or R + Tidyverse will do the job nicely too.


R without tidyverse (which is just sugar) will do just as nicely.


Agree. But tidyverse just makes things a lot more elegant to write.


Python + Pandas + Jupyter Notebook/Lab


Python + Jupyter OK, but pandas actually reads everything at once, doesn’t it. 100MB is no problem but bigger files could result in high swapping pression.


I definitely agree that with this amount of data, you should move to a more programmatic way to handle it... pandas or R.

Keep in mind that pandas (and probably also R?) internally uses optimized structures based on numpy. So a 10 GB csv, depending on the content, might end up with a much smnaller memory footprint inside pandas.

If you have 10 GB csv, I think you will be happy working with pandas locally even on a Laptop. If you go to csv files with tens of GB, a cloud vm with corresponding memory might serve you well. If you need to handle big-data-scale csvs (hundreds of GB or even >TB), a scalable parallel solution like Spark will be your thing. Before you scale up however, maybe your task allows to pre-filter the data and reduces the amount by orders of magnitude... often, thinking the problem through reduces the amount of metal one needs to throw at the problem...


I use XSV: A fast CSV command line toolkit written in Rust.

https://formulae.brew.sh/formula/xsv

https://github.com/BurntSushi/xsv


You can use EBay tools as well, such as https://github.com/eBay/tsv-utils after importing a CSV file as TSV.


I am not clear what "manipulate" means here -- what is the author trying to do with the comma separated values? FWIW, I can accomplish csv manipulation using a handful of Unix utilities: sed, awk, cut, and call it a day.


The text in a column can be quoted and inside the quotes there can be escaped quotes or commas.

It requires a lot of sed awk skills to merge two columns or delete a column. If at all possible


Yes, the "CSV road to hell":

- comma separated, nothing escaped (crash when 1 column contains a comma)

- comma separated, quotes around all elements, quotes not escaped

- comma separated, double-quotes around all elements, double-quotes not escaped

- comma separated, quotes around some elements, quotes not escaped

- comma separated, double-quotes around some elements, double-quotes not escaped

- comma separated, quotes around all elements, quotes escaped (using '')

- comma separated, double-quotes around all elements, double-quotes escaped (using "")

- comma separated, quotes around some elements, quotes escaped (using '')

- comma separated, double-quotes around some elements, double-quotes escaped (using "")

- comma separated, quotes around all elements, quotes escaped (using \')

- comma separated, double-quotes around all elements, double-quotes escaped (using \")

- comma separated, quotes around some elements, quotes escaped (using \')

- comma separated, double-quotes around some elements, double-quotes escaped (using \")

And the Microsoft format (where comma == semi-colon):

- semi-colon separated, nothing escaped (crash when 1 column contains a comma)

- semi-colon separated, quotes around all elements, quotes not escaped

- semi-colon separated, double-quotes around all elements, double-quotes not escaped

- semi-colon separated, quotes around some elements, quotes not escaped

- semi-colon separated, double-quotes around some elements, double-quotes not escaped

- semi-colon separated, quotes around all elements, quotes escaped (using '')

- semi-colon separated, double-quotes around all elements, double-quotes escaped (using "")

- semi-colon separated, quotes around some elements, quotes escaped (using '')

- semi-colon separated, double-quotes around some elements, double-quotes escaped (using "")

- semi-colon separated, quotes around all elements, quotes escaped (using \')

- semi-colon separated, double-quotes around all elements, double-quotes escaped (using \")

- semi-colon separated, quotes around some elements, quotes escaped (using \')

- semi-colon separated, double-quotes around some elements, double-quotes escaped (using \")

And I'm not talking about some weird custom CSV variants to support multi-lines for example or any other "I want to fit a circle in a square" mentality.

I don't know why people doesn't simply create TSV file (Tab-separated). No characters espacing mess. MUCHHH easier to parse.

EDIT: Formatting


This is the best answer if the csv file is more or less standard.


awk, sed, cut, etc... are excellent as long as the format is regular and you only need to process the file once


https://kothar.net/csview

A fast viewer a friend of mine created to view large CSVs in a GUI - might be useful to someone.


Sublime Text and VSCode work fine with 1GB files in my experience.


I think the article tries to go into further depth beyond what a text editor can do on CSV files (filtering, sorting).

That said, worth mentioning CSV syntax highlighters for text editors. I’ve found rainbow csv quite helpful (https://github.com/mechatroner/sublime_rainbow_csv)


I've had pretty good experiences with both of those editing the raw files. However, neither of them give capabilities like moving columns or doing filtering. But for basic editing or viewing, they definitely work.


There seem to be literally dozens of solutions to do read-only operations but very few to enable comfortable editing of the files in-place in a Unix / command line environment.

Seems like a real gap in the software ecosystem atm:

    - fast
    - no limit on file size
    - spreadsheet style layout
    - command line
    - easily edit and update individual cell => save
I've tried VIM's CSV plugins many times and have never been satisfied.


Have you tried VisiData? https://visidata.org.


Wow, that's exactly what I wanted - thanks!


@alecdibble Could you try LibreOffice's Calc [1]? It's my daily driver in Linux, it mostly works well as an excel replacement but I'm interested to see how well it does on a Mac.

[1] https://www.libreoffice.org/download/download/


Maximum number of Rows per worksheet = 1 048 576 (2^20).


Apparently you should be able to go way above this [1].

[1] https://stackoverflow.com/questions/41213869/where-to-find-t...


Was SequelPro trying impute any field types? That could have been causing the crash when faced with a large file.


I don't believe so as it was a CSV export from Sequel Pro. I could be wrong, but the import worked fine using the same file truncated to the first few thousand rows.


tad is absolutely great for this, i tested it for the exact use case mentioned. I had to file a ticket and wait for him to add an export to csv function. tad is built over sqlite and can filter/sort/pivot/aggregate and export the result, which is all my business team partners ever need. can handle GB files, i didnt test with TB size.

https://www.tadviewer.com/, its a desktop app

edit: one note after seeing other comments based on cli (xsv, sed, awk, etc) — the OP’s use case is something that marketing/pm/business stakeholders can use. my favorite tools are cli-based, however this does not fly with business teams so forget that option.


As a huge fan of OpenRefine, I want to say thanks for introducing me to Sequel Pro.


Used VIM in the past to open several GB of SQL files to edit them without problems.


Memory mapped file.. Intermediate binary representation..


Why do you want to do that?

Use cat, pipe, grep, awk. Problem solved.


I personally am not a huge fan of awk, I've never built a great mental model around its syntax and it doesn't really solve the problem I was talking about, which is getting a spreadsheet-like editing experience. Thanks for bringing it up, I should definitely add it to the article.


For the quoted spreadsheet-like operations of filtering and rearranging, awk is perfect as a deferred editor. That the kludgy first step of your chosen solution ("First, you must create a CSV file contain only the first 10-20 lines of your large CSV file") isn't just `head very_large_nov_2019.csv > very_large_nov_2019_abridged.csv` seems to further indicate an unfamiliarity with the large set of built-in, battle-tested UNIX tools for dealing with text files.

The first tools I reach for when dealing with CSVs of these and larger magnitudes are less, cut, awk, etc. They also tend to be the last tools I end up needing.


How well do those tools work with arbitrary CSV files, e.g. containing line breaks or quotes in field data? I wasn't aware that they can actually parse CSV and instead you have to assume things about the content that may not end up being true.


Every data processing task has to make assumptions about the well-formedness of its input. "Arbitrary CSV" is basically undefined; whether deviations are best dealt with by parsing, preprocessing, or different tools altogether depends on the source.


I'm always astounded that there doesn't seem to be a decent general purpose CSV editor/viewer application. Excel is atrocious - it's always dog slow, and it mangles any CSV I've ever opened by trying to interpret the data to format it "smartly".

Having to build a table in a database and import the CSV into that feels a bit like hitting a house fly with a sledgehammer, but it's the most effective way I've seen.


One can convert CSV into JSON (a recipe is e.g. here - https://infiniteundo.com/post/99336704013/convert-csv-to-jso... ) - in streaming fashion - and then use jq ( https://stedolan.github.io/jq/ ) to do what one needs.


http://visidata.org/ Is pretty fast if you’re comfortable in a terminal (not cli) app.


> and it mangles any CSV I've ever opened by trying to interpret the data to format it "smartly".

You should be using the Data -> From Text importer, not just double clicking. Also use the latest 64 bit version of Excel.


There is csvkit and xsv and I’ve seen a few treat a csv file as a database projects around




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: