Thank you for linking this. It seems to be much more intuitive than awk, especially for his particular purpose. I wish I would have found this a few months ago when I was slicing and dicing lots of data for an extensive system migration project.
nice.
long time ago I had much success importing and filtering multiple gigabytes of CSV data into elasticsearch using CSVfix (if I'm not mistaken) + jq (converting it to json-lines in between, using jq as well)
xsv seems to cover some areas I've used jq.
No mention of vi or Sqlite? While I'm no vi expert it's a great tool for working with big files when you want to browse around without grep. And Sqlite is similarly ubiquitous and capable of crunching large files.
vi is an editor so it doesn't really solve my requirement of getting spreadsheet-like editing capabilities. SQlite is a good idea, I never thought of that. I will investigate that and add it to the article. Thanks!
> Excel for Mac performed well but it is a paid solution so I did not consider it viable for many developers
Because developers handling gigabyte size data, and wanting to reliably manipulate it in a GUI, cannot possibly be expected to pay/afford the $7/month to Microsoft.
That said, the recommended solution is probably the best option for developers, not bedside because it's free, but for the ability to run complex SQL statements, and visualize the results.
If I were to edit this article, that'd be my takeaway: use tool X for snappy vitalization of SQL queries, even on multi gigabyte sized CSVs
I wonder how well would Table Tool [1] perform with your large dataset?
This is an open source CSV editor for Mac from the developer of Postico, my favorite PostgreSQL client for Mac [2]
Python + Jupyter OK, but pandas actually reads everything at once, doesn’t it. 100MB is no problem but bigger files could result in high swapping pression.
I definitely agree that with this amount of data, you should move to a more programmatic way to handle it... pandas or R.
Keep in mind that pandas (and probably also R?) internally uses optimized structures based on numpy. So a 10 GB csv, depending on the content, might end up with a much smnaller memory footprint inside pandas.
If you have 10 GB csv, I think you will be happy working with pandas locally even on a Laptop. If you go to csv files with tens of GB, a cloud vm with corresponding memory might serve you well. If you need to handle big-data-scale csvs (hundreds of GB or even >TB), a scalable parallel solution like Spark will be your thing.
Before you scale up however, maybe your task allows to pre-filter the data and reduces the amount by orders of magnitude... often, thinking the problem through reduces the amount of metal one needs to throw at the problem...
I am not clear what "manipulate" means here -- what is the author trying to do with the comma separated values? FWIW, I can accomplish csv manipulation using a handful of Unix utilities: sed, awk, cut, and call it a day.
- comma separated, nothing escaped (crash when 1 column contains a comma)
- comma separated, quotes around all elements, quotes not escaped
- comma separated, double-quotes around all elements, double-quotes not escaped
- comma separated, quotes around some elements, quotes not escaped
- comma separated, double-quotes around some elements, double-quotes not escaped
- comma separated, quotes around all elements, quotes escaped (using '')
- comma separated, double-quotes around all elements, double-quotes escaped (using "")
- comma separated, quotes around some elements, quotes escaped (using '')
- comma separated, double-quotes around some elements, double-quotes escaped (using "")
- comma separated, quotes around all elements, quotes escaped (using \')
- comma separated, double-quotes around all elements, double-quotes escaped (using \")
- comma separated, quotes around some elements, quotes escaped (using \')
- comma separated, double-quotes around some elements, double-quotes escaped (using \")
And the Microsoft format (where comma == semi-colon):
- semi-colon separated, nothing escaped (crash when 1 column contains a comma)
- semi-colon separated, quotes around all elements, quotes not escaped
- semi-colon separated, double-quotes around all elements, double-quotes not escaped
- semi-colon separated, quotes around some elements, quotes not escaped
- semi-colon separated, double-quotes around some elements, double-quotes not escaped
- semi-colon separated, quotes around all elements, quotes escaped (using '')
- semi-colon separated, double-quotes around all elements, double-quotes escaped (using "")
- semi-colon separated, quotes around some elements, quotes escaped (using '')
- semi-colon separated, double-quotes around some elements, double-quotes escaped (using "")
- semi-colon separated, quotes around all elements, quotes escaped (using \')
- semi-colon separated, double-quotes around all elements, double-quotes escaped (using \")
- semi-colon separated, quotes around some elements, quotes escaped (using \')
- semi-colon separated, double-quotes around some elements, double-quotes escaped (using \")
And I'm not talking about some weird custom CSV variants to support multi-lines for example or any other "I want to fit a circle in a square" mentality.
I don't know why people doesn't simply create TSV file (Tab-separated). No characters espacing mess. MUCHHH easier to parse.
I've had pretty good experiences with both of those editing the raw files. However, neither of them give capabilities like moving columns or doing filtering. But for basic editing or viewing, they definitely work.
There seem to be literally dozens of solutions to do read-only operations but very few to enable comfortable editing of the files in-place in a Unix / command line environment.
Seems like a real gap in the software ecosystem atm:
- fast
- no limit on file size
- spreadsheet style layout
- command line
- easily edit and update individual cell => save
I've tried VIM's CSV plugins many times and have never been satisfied.
@alecdibble Could you try LibreOffice's Calc [1]? It's my daily driver in Linux, it mostly works well as an excel replacement but I'm interested to see how well it does on a Mac.
I don't believe so as it was a CSV export from Sequel Pro. I could be wrong, but the import worked fine using the same file truncated to the first few thousand rows.
tad is absolutely great for this, i tested it for the exact use case mentioned. I had to file a ticket and wait for him to add an export to csv function. tad is built over sqlite and can filter/sort/pivot/aggregate and export the result, which is all my business team partners ever need. can handle GB files, i didnt test with TB size.
edit:
one note after seeing other comments based on cli (xsv, sed, awk, etc) — the OP’s use case is something that marketing/pm/business stakeholders can use. my favorite tools are cli-based, however this does not fly with business teams so forget that option.
I personally am not a huge fan of awk, I've never built a great mental model around its syntax and it doesn't really solve the problem I was talking about, which is getting a spreadsheet-like editing experience. Thanks for bringing it up, I should definitely add it to the article.
For the quoted spreadsheet-like operations of filtering and rearranging, awk is perfect as a deferred editor. That the kludgy first step of your chosen solution ("First, you must create a CSV file contain only the first 10-20 lines of your large CSV file") isn't just `head very_large_nov_2019.csv > very_large_nov_2019_abridged.csv` seems to further indicate an unfamiliarity with the large set of built-in, battle-tested UNIX tools for dealing with text files.
The first tools I reach for when dealing with CSVs of these and larger magnitudes are less, cut, awk, etc. They also tend to be the last tools I end up needing.
How well do those tools work with arbitrary CSV files, e.g. containing line breaks or quotes in field data? I wasn't aware that they can actually parse CSV and instead you have to assume things about the content that may not end up being true.
Every data processing task has to make assumptions about the well-formedness of its input. "Arbitrary CSV" is basically undefined; whether deviations are best dealt with by parsing, preprocessing, or different tools altogether depends on the source.
I'm always astounded that there doesn't seem to be a decent general purpose CSV editor/viewer application. Excel is atrocious - it's always dog slow, and it mangles any CSV I've ever opened by trying to interpret the data to format it "smartly".
Having to build a table in a database and import the CSV into that feels a bit like hitting a house fly with a sledgehammer, but it's the most effective way I've seen.
It's definitely not the problem of the resources, but only of the architecture of applications.
(Indeed, the right application is a DBMS, not a spreadsheet.)