Hacker News new | comments | show | ask | jobs | submit login
The 5 lines that mystified O’Reilly - how to use a spreadsheet in bash (venturecake.com)
6 points by nailer 3315 days ago | hide | past | web | 10 comments | favorite



To parse a single CSV file with well defined rules is easy. However, different spreadsheet vendors handle escaped characters and various issues differently. http://www.rtraction.com/blog/devit/excel-csv-to-mysql-hell.... speaks of one problem that can occur. If you're trying to give example code that will work with a random CSV in the wild, it's going to be more than five lines. I'm guessing such a discussion was probably edited out as non-germane.


Heh... Looks like he didn't test against a spreadsheet that contains a value with a comma in it. And cells with carriage returns, although rarer, would also parse incorrectly. The O'Reilly book is correct that the Python/Perl library would cope better.

The reason the O'Reilly book says that it is difficult to define CSV is because many software authors assumed that exporting to CSV is easy (like this article), so they implemented the finer points in slightly different, incompatible ways.


The simplest way to use a spreadsheet with a value with commas, is to pick a different delimiter - eg, '%' or ':'.

Then change the SEPARATOR variable in the included scripts accordingly.

Cells with carriage returns would indeed be a problem but as you mentioned, they're quite unlikely.


I can see how picking a different delimiter character is a simple solution, but it's not practical... The point of CSV is that it is easy to export directly from a spreadsheet program, so that you can handle large amounts of data automatically. If you have to scour your spreadsheet each time to make sure your chosen delimiter is not present in the data, this defeats the purpose of an automated approach. (And some spreadsheet programs will not let you choose the export delimiter. And your code will still not handle, say, a cell with a quotation mark in it.)

It's OK for a quick-and-dirty solution and a limited amount of data, but it's not much harder to write a solution that correctly handles all input using the Python library.


> If you have to scour your spreadsheet each time to make sure your chosen delimiter is not present in the data, this defeats the purpose of an automated approach.

'scour' is a little exaggerated, don't you think?

Most data doesn't have commas in it, if you think it might, use a different delimiter.

> It's not much harder to write a solution that correctly handles all input using the Python library.

Sure, you can use Python, but I'm writing the article for Bash users - people getting into scripting for the first time - and the solution works reliably.

Edit: removed the '99% of cases' as provided there is at least one character that does not appear in your data (eg, %, #, etc) the solution simply works - and otherwise you have an extreme edge-case scenario.


'scour' is a little exaggerated, don't you think?

Maybe a little, but it is an extra manual step you'd need to take each time you export CSV data, simply because the bash parser isn't equipped to handle it. And, as I said, you'd also need to check for quotation marks in the data -- Excel escapes them by doubling them up -- which the bash parser also won't handle as expected. These are not exactly uncommon characters in text data -- hardly extreme edge cases.

...Not that I mean to harp on the issue, but you did describe O'Reilly's statement that parsing CSV is "harder than it sounds" as "complete and utter horseshit". In light of the article's oversights, that strikes me as a little exaggerated. ;)


We're talking Unix config options and command line parameters. Having a quote in the middle of such a thing is, for me, unheard of.

Having a quite around them is common, and shown inside the second example for this reason.

I think O'Reilly should provide a solution that works for their audience (primarily system administrators, judging from the books other contents) if they have a chapter on the topic. Preferably in Bash, as that's what the book is focused on.

Mentioning specific parsing tools built into Python would also also be of use (other chapters in the same book do pull out other languages if deemed necessary).

This chapter stands out as it exists in the TOC and has no code at all, when a very simple solution will handle common admin tasks easily.


"The solution works reliably for 99% of cases" is another way of saying the solution is not reliable. It's the 1% cases that make the problem harder, and that's what the author was referring to.

edit: As a pedagogical exercise it's fine, but I wouldn't rely on it in practice.


> "The solution works reliably for 99% of cases" is another way of saying the solution is not reliable.

No, it's a way of saying the solution does not work in extreme edge case scenarios, specifically where there is no single character that in usused in your content and therefore no delimiter available.


Except I think that situation is more likely than you think.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: