
How To Open and Manipulate Large CSV Files On A Mac - alecdibble
https://www.alecdibble.com/blog/large-csvs-on-mac/
======
nine_k
The counter-intuitive part is that a 100MB file is considered large on a
machine with 8-16GB RAM.

It's definitely not the problem of the resources, but only of the architecture
of applications.

(Indeed, the right application is a DBMS, not a spreadsheet.)

------
mamcx
You could get interested in:

[https://github.com/BurntSushi/xsv](https://github.com/BurntSushi/xsv)

~~~
alecdibble
Thank you for linking this. It seems to be much more intuitive than awk,
especially for his particular purpose. I wish I would have found this a few
months ago when I was slicing and dicing lots of data for an extensive system
migration project.

------
adouzzy
When I saw "large", I expected >10GB. And "big data" if it needs to be batch
processed or distributed processed.

------
paulryanrogers
No mention of vi or Sqlite? While I'm no vi expert it's a great tool for
working with big files when you want to browse around without grep. And Sqlite
is similarly ubiquitous and capable of crunching large files.

~~~
alecdibble
vi is an editor so it doesn't really solve my requirement of getting
spreadsheet-like editing capabilities. SQlite is a good idea, I never thought
of that. I will investigate that and add it to the article. Thanks!

~~~
gav
SQLite does a great job of importing CSV, and then you're able to use
something like "DB Browser for SQLite"[1] to browse the data.

I use something like this to fix up column names and import:

    
    
        sed -i '' -e '1 s/ \/ /_/g; 1 s/[() ]/_/g' $csv_file
        sqlite3 $db_file <<SQL
        .mode csv
        .import $csv_file $table_name
        SQL
    

[1] [https://sqlitebrowser.org/](https://sqlitebrowser.org/)

~~~
alecdibble
Amazing, I will definitely try this out. Thank you!

------
dbt00
Two of my favorite tools for this kind of thing:

[http://visidata.org/](http://visidata.org/)

And

[http://recs.pl/](http://recs.pl/)

------
throwGuardian
> Excel for Mac performed well but it is a paid solution so I did not consider
> it viable for many developers

Because developers handling gigabyte size data, and wanting to reliably
manipulate it in a GUI, cannot possibly be expected to pay/afford the $7/month
to Microsoft.

That said, the recommended solution is probably the best option for
developers, not bedside because it's free, but for the ability to run complex
SQL statements, and visualize the results.

If I were to edit this article, that'd be my takeaway: use tool X for snappy
vitalization of SQL queries, even on multi gigabyte sized CSVs

~~~
apta
> cannot possibly be expected to pay/afford the $7/month to Microsoft.

Especially after paying $2000+ for a mac.

------
lerigner
I wonder how well would Table Tool [1] perform with your large dataset? This
is an open source CSV editor for Mac from the developer of Postico, my
favorite PostgreSQL client for Mac [2]

[1] [https://github.com/jakob/TableTool](https://github.com/jakob/TableTool)

[2] [https://eggerapps.at/postico/](https://eggerapps.at/postico/)

------
coverman
Python + Pandas

~~~
appleiigs
Python + Pandas + Jupyter Notebook/Lab

~~~
anst
Python + Jupyter OK, but pandas actually reads everything at once, doesn’t it.
100MB is no problem but bigger files could result in high swapping pression.

~~~
cgufus
I definitely agree that with this amount of data, you should move to a more
programmatic way to handle it... pandas or R.

Keep in mind that pandas (and probably also R?) internally uses optimized
structures based on numpy. So a 10 GB csv, depending on the content, might end
up with a much smnaller memory footprint inside pandas.

If you have 10 GB csv, I think you will be happy working with pandas locally
even on a Laptop. If you go to csv files with tens of GB, a cloud vm with
corresponding memory might serve you well. If you need to handle big-data-
scale csvs (hundreds of GB or even >TB), a scalable parallel solution like
Spark will be your thing. Before you scale up however, maybe your task allows
to pre-filter the data and reduces the amount by orders of magnitude... often,
thinking the problem through reduces the amount of metal one needs to throw at
the problem...

------
kjpatel
I use XSV: A fast CSV command line toolkit written in Rust.

[https://formulae.brew.sh/formula/xsv](https://formulae.brew.sh/formula/xsv)

[https://github.com/BurntSushi/xsv](https://github.com/BurntSushi/xsv)

~~~
ekianjo
You can use EBay tools as well, such as [https://github.com/eBay/tsv-
utils](https://github.com/eBay/tsv-utils) after importing a CSV file as TSV.

------
whydoyoucare
I am not clear what "manipulate" means here -- what is the author trying to do
with the comma separated values? FWIW, I can accomplish csv manipulation using
a handful of Unix utilities: sed, awk, cut, and call it a day.

~~~
zaphirplane
The text in a column can be quoted and inside the quotes there can be escaped
quotes or commas.

It requires a lot of sed awk skills to merge two columns or delete a column.
If at all possible

~~~
grumpy-cowboy
Yes, the "CSV road to hell":

\- comma separated, nothing escaped (crash when 1 column contains a comma)

\- comma separated, quotes around all elements, quotes not escaped

\- comma separated, double-quotes around all elements, double-quotes not
escaped

\- comma separated, quotes around some elements, quotes not escaped

\- comma separated, double-quotes around some elements, double-quotes not
escaped

\- comma separated, quotes around all elements, quotes escaped (using '')

\- comma separated, double-quotes around all elements, double-quotes escaped
(using "")

\- comma separated, quotes around some elements, quotes escaped (using '')

\- comma separated, double-quotes around some elements, double-quotes escaped
(using "")

\- comma separated, quotes around all elements, quotes escaped (using \')

\- comma separated, double-quotes around all elements, double-quotes escaped
(using \")

\- comma separated, quotes around some elements, quotes escaped (using \')

\- comma separated, double-quotes around some elements, double-quotes escaped
(using \")

And the Microsoft format (where comma == semi-colon):

\- semi-colon separated, nothing escaped (crash when 1 column contains a
comma)

\- semi-colon separated, quotes around all elements, quotes not escaped

\- semi-colon separated, double-quotes around all elements, double-quotes not
escaped

\- semi-colon separated, quotes around some elements, quotes not escaped

\- semi-colon separated, double-quotes around some elements, double-quotes not
escaped

\- semi-colon separated, quotes around all elements, quotes escaped (using '')

\- semi-colon separated, double-quotes around all elements, double-quotes
escaped (using "")

\- semi-colon separated, quotes around some elements, quotes escaped (using
'')

\- semi-colon separated, double-quotes around some elements, double-quotes
escaped (using "")

\- semi-colon separated, quotes around all elements, quotes escaped (using \')

\- semi-colon separated, double-quotes around all elements, double-quotes
escaped (using \")

\- semi-colon separated, quotes around some elements, quotes escaped (using
\')

\- semi-colon separated, double-quotes around some elements, double-quotes
escaped (using \")

And I'm not talking about some weird custom CSV variants to support multi-
lines for example or any other "I want to fit a circle in a square" mentality.

I don't know why people doesn't simply create TSV file (Tab-separated). No
characters espacing mess. MUCHHH easier to parse.

EDIT: Formatting

------
70jS8h5L
[https://kothar.net/csview](https://kothar.net/csview)

A fast viewer a friend of mine created to view large CSVs in a GUI - might be
useful to someone.

------
GiorgioG
Sublime Text and VSCode work fine with 1GB files in my experience.

~~~
seltzered_
I think the article tries to go into further depth beyond what a text editor
can do on CSV files (filtering, sorting).

That said, worth mentioning CSV syntax highlighters for text editors. I’ve
found rainbow csv quite helpful
([https://github.com/mechatroner/sublime_rainbow_csv](https://github.com/mechatroner/sublime_rainbow_csv))

------
zmmmmm
There seem to be literally dozens of solutions to do read-only operations but
very few to enable comfortable editing of the files in-place in a Unix /
command line environment.

Seems like a real gap in the software ecosystem atm:

    
    
        - fast
        - no limit on file size
        - spreadsheet style layout
        - command line
        - easily edit and update individual cell => save
    

I've tried VIM's CSV plugins many times and have never been satisfied.

~~~
rabidrat
Have you tried VisiData? [https://visidata.org](https://visidata.org).

~~~
zmmmmm
Wow, that's exactly what I wanted - thanks!

------
bArray
@alecdibble Could you try LibreOffice's Calc [1]? It's my daily driver in
Linux, it mostly works well as an excel replacement but I'm interested to see
how well it does on a Mac.

[1]
[https://www.libreoffice.org/download/download/](https://www.libreoffice.org/download/download/)

~~~
xrftghvb
Maximum number of Rows per worksheet = 1 048 576 (2^20).

~~~
bArray
Apparently you should be able to go way above this [1].

[1] [https://stackoverflow.com/questions/41213869/where-to-
find-t...](https://stackoverflow.com/questions/41213869/where-to-find-the-
base-module-of-libreoffice-that-has-no-row-limit)

------
entee
Was SequelPro trying impute any field types? That could have been causing the
crash when faced with a large file.

~~~
alecdibble
I don't believe so as it was a CSV export from Sequel Pro. I could be wrong,
but the import worked fine using the same file truncated to the first few
thousand rows.

------
banku_brougham
tad is absolutely great for this, i tested it for the exact use case
mentioned. I had to file a ticket and wait for him to add an export to csv
function. tad is built over sqlite and can filter/sort/pivot/aggregate and
export the result, which is all my business team partners ever need. can
handle GB files, i didnt test with TB size.

[https://www.tadviewer.com/](https://www.tadviewer.com/), its a desktop app

edit: one note after seeing other comments based on cli (xsv, sed, awk, etc) —
the OP’s use case is something that marketing/pm/business stakeholders can
use. my favorite tools are cli-based, however this does not fly with business
teams so forget that option.

------
jwx48
As a huge fan of OpenRefine, I want to say thanks for introducing me to Sequel
Pro.

------
_Codemonkeyism
Used VIM in the past to open several GB of SQL files to edit them without
problems.

------
thrax
Memory mapped file.. Intermediate binary representation..

------
1996
Why do you want to do that?

Use cat, pipe, grep, awk. Problem solved.

~~~
alecdibble
I personally am not a huge fan of awk, I've never built a great mental model
around its syntax and it doesn't really solve the problem I was talking about,
which is getting a spreadsheet-like editing experience. Thanks for bringing it
up, I should definitely add it to the article.

~~~
bangonkeyboard
For the quoted spreadsheet-like operations of filtering and rearranging, awk
is perfect as a deferred editor. That the kludgy first step of your chosen
solution ("First, you must create a CSV file contain only the first 10-20
lines of your large CSV file") isn't just `head very_large_nov_2019.csv >
very_large_nov_2019_abridged.csv` seems to further indicate an unfamiliarity
with the large set of built-in, battle-tested UNIX tools for dealing with text
files.

The first tools I reach for when dealing with CSVs of these and larger
magnitudes are less, cut, awk, etc. They also tend to be the last tools I end
up needing.

~~~
ygra
How well do those tools work with arbitrary CSV files, e.g. containing line
breaks or quotes in field data? I wasn't aware that they can actually parse
CSV and instead you have to assume things about the content that may not end
up being true.

~~~
bangonkeyboard
Every data processing task has to make assumptions about the well-formedness
of its input. "Arbitrary CSV" is basically undefined; whether deviations are
best dealt with by parsing, preprocessing, or different tools altogether
depends on the source.

------
thrower123
I'm always astounded that there doesn't seem to be a decent general purpose
CSV editor/viewer application. Excel is atrocious - it's always dog slow, and
it mangles any CSV I've ever opened by trying to interpret the data to format
it "smartly".

Having to build a table in a database and import the CSV into that feels a bit
like hitting a house fly with a sledgehammer, but it's the most effective way
I've seen.

~~~
avmich
One can convert CSV into JSON (a recipe is e.g. here -
[https://infiniteundo.com/post/99336704013/convert-csv-to-
jso...](https://infiniteundo.com/post/99336704013/convert-csv-to-json-with-jq)
) - in streaming fashion - and then use jq (
[https://stedolan.github.io/jq/](https://stedolan.github.io/jq/) ) to do what
one needs.

