Hacker Newsnew | comments | leaders | jobs | submitlogin
The 5 lines that mystified O’Reilly - how to use a spreadsheet in bash (venturecake.com)
6 points by nailer 502 days ago | 10 comments


3 points by ibsulon 502 days ago | link

To parse a single CSV file with well defined rules is easy. However, different spreadsheet vendors handle escaped characters and various issues differently. http://www.rtraction.com/blog/devit/excel-csv-to-mysql-hell.... speaks of one problem that can occur. If you're trying to give example code that will work with a random CSV in the wild, it's going to be more than five lines. I'm guessing such a discussion was probably edited out as non-germane.

-----

2 points by jcl 502 days ago | link

Heh... Looks like he didn't test against a spreadsheet that contains a value with a comma in it. And cells with carriage returns, although rarer, would also parse incorrectly. The O'Reilly book is correct that the Python/Perl library would cope better.

The reason the O'Reilly book says that it is difficult to define CSV is because many software authors assumed that exporting to CSV is easy (like this article), so they implemented the finer points in slightly different, incompatible ways.

-----

1 point by nailer 502 days ago | link

The simplest way to use a spreadsheet with a value with commas, is to pick a different delimiter - eg, '%' or ':'.

Then change the SEPARATOR variable in the included scripts accordingly.

Cells with carriage returns would indeed be a problem but as you mentioned, they're quite unlikely.

-----

1 point by jcl 502 days ago | link

I can see how picking a different delimiter character is a simple solution, but it's not practical... The point of CSV is that it is easy to export directly from a spreadsheet program, so that you can handle large amounts of data automatically. If you have to scour your spreadsheet each time to make sure your chosen delimiter is not present in the data, this defeats the purpose of an automated approach. (And some spreadsheet programs will not let you choose the export delimiter. And your code will still not handle, say, a cell with a quotation mark in it.)

It's OK for a quick-and-dirty solution and a limited amount of data, but it's not much harder to write a solution that correctly handles all input using the Python library.

-----

1 point by nailer 502 days ago | link

> If you have to scour your spreadsheet each time to make sure your chosen delimiter is not present in the data, this defeats the purpose of an automated approach.

'scour' is a little exaggerated, don't you think?

Most data doesn't have commas in it, if you think it might, use a different delimiter.

> It's not much harder to write a solution that correctly handles all input using the Python library.

Sure, you can use Python, but I'm writing the article for Bash users - people getting into scripting for the first time - and the solution works reliably.

Edit: removed the '99% of cases' as provided there is at least one character that does not appear in your data (eg, %, #, etc) the solution simply works - and otherwise you have an extreme edge-case scenario.

-----

2 points by scott_s 502 days ago | link

"The solution works reliably for 99% of cases" is another way of saying the solution is not reliable. It's the 1% cases that make the problem harder, and that's what the author was referring to.

edit: As a pedagogical exercise it's fine, but I wouldn't rely on it in practice.

-----

1 point by nailer 502 days ago | link

> "The solution works reliably for 99% of cases" is another way of saying the solution is not reliable.

No, it's a way of saying the solution does not work in extreme edge case scenarios, specifically where there is no single character that in usused in your content and therefore no delimiter available.

-----

2 points by scott_s 502 days ago | link

Except I think that situation is more likely than you think.

-----

1 point by jcl 502 days ago | link

'scour' is a little exaggerated, don't you think?

Maybe a little, but it is an extra manual step you'd need to take each time you export CSV data, simply because the bash parser isn't equipped to handle it. And, as I said, you'd also need to check for quotation marks in the data -- Excel escapes them by doubling them up -- which the bash parser also won't handle as expected. These are not exactly uncommon characters in text data -- hardly extreme edge cases.

...Not that I mean to harp on the issue, but you did describe O'Reilly's statement that parsing CSV is "harder than it sounds" as "complete and utter horseshit". In light of the article's oversights, that strikes me as a little exaggerated. ;)

-----

1 point by nailer 502 days ago | link

We're talking Unix config options and command line parameters. Having a quote in the middle of such a thing is, for me, unheard of.

Having a quite around them is common, and shown inside the second example for this reason.

I think O'Reilly should provide a solution that works for their audience (primarily system administrators, judging from the books other contents) if they have a chapter on the topic. Preferably in Bash, as that's what the book is focused on.

Mentioning specific parsing tools built into Python would also also be of use (other chapters in the same book do pull out other languages if deemed necessary).

This chapter stands out as it exists in the TOC and has no code at all, when a very simple solution will handle common admin tasks easily.

-----




Lists | RSS | Bookmarklet | Guidelines | FAQ | News News | Feature Requests | Y Combinator | Apply | Library

Analytics by Mixpanel