Hacker News new | past | comments | ask | show | jobs | submit login
Excel: Error when CSV file starts with “I” and “D” (microsoft.com)
286 points by rompic on July 6, 2016 | hide | past | web | favorite | 136 comments

Maybe my bug? I wrote most of the original Excel text-based file import code in 1984 (including the SLYK format, which was the main way we migrated files from Multiplan to Excel).

Hi mckoss, awesome to hear one of the main Excel architects is here on HN!

Any interesting stories to share? I think we'd all be interested.

Mind if I ask a question directly related to one of my comments a few days ago? I was lamenting the fact that the ASCII codes 29, 30, 31 (Group, Record, and File separators) never really became widely implemented, as these were specifically designed to delimit data. Ie, one could easily include commas, new lines/carriage returns, etc in data cells without clashing. But instead CSV seems to be the most common standard for tabular data.

Were these ASCII codes ever considered for tabular files?


You couldn't conveniently type those codes on a keyboard and there was no standard way to display them. You'd need a special editor for such files, eliminating the main advantage of an interchangeable text based file format.

Part of your response is accurate as to 'why' dedicated separators never became mainstream - software compatibility - but frankly 99% of csv files were/are never meant to be hand edited, simply used as a neutral file. For the later usage, dedicated separators that do not require escaping or import settings would be major improvement.

For files not intended to be hand-edited, it doesn't really matter what delimiters you use. It's not that difficult to code up char-escaping for csv.

At Yahoo, the typical delimiters for logs and whatnot were ctrl+a, ctrl+b, etc. It was slightly nicer than CSV, but only slightly. It was mostly nicer when manually inspecting files with columns that had embedded commas (that otherwise would have been escaped). The machines don't care, and for any interesting processing you'd often end up with escaping anyway.

But csv files can also be manipulated in shell scripts, with the text being piped from one util to the next. Here it becomes important that the separators can be correctly typed and displayed.

The standard way to type them is as control-\, control-], control-^, and control-_, which has been true since the introduction of control keys in the 1960s. It's true that there isn't a standard way to display them across platforms; some use ␜, ␝,␞, and ␟, while others use ^\, ^], ^^, and ^_ (sometimes in inverse video), and maybe the most common display would be the CP437 glyphs, which render them as ∟, ↔, ▲, and ▼. In the Unicode standard, they are displayed as dotted boxes around the strings "FS", "GS", "RS", and "US".

Emacs can edit files in this format without any extra work (it displays them as ^\, ^_, etc., with a different text color so that you can easily distinguish them from character sequences like "^" followed by "\") but maybe you mean to say that Emacs by itself doesn't understand the hierarchical structure of such a file.

This is easily fixed. You can get Emacs forms-mode for a file with these delimiters as follows:

    (setq forms-field-sep "\036")
    (setq forms-multi-line "\037")
    (setq forms-read-file-filter 'forms-replace-gs-with-newlines)
    (setq forms-write-file-filter 'forms-replace-newlines-with-gs)

    (setq forms-file "fsgsusrs.data")
    (setq forms-number-of-fields (forms-enumerate '(name aliases wikipedia employer notes)))
    (setq forms-format-list
           "Project for a New American Century conspirators\n\n"
           "\n         Name: " name
           "\n      Aliases: " aliases
           "\n     Employer: " employer
           "\nWikipedia URL: " wikipedia
           "\n\nNotes:\n\n" notes))

    (defun forms-replace-newlines-with-gs ()
      (goto-char (point-min))
      (while (search-forward "\n" nil t)
        (replace-match "\035" nil t)))

    (defun forms-replace-gs-with-newlines ()
      (goto-char (point-min))
      (while (search-forward "\035" nil t)
        (replace-match "\n" nil t)))
It took me 35 minutes to write this "special editor for such files"; thus your argument is invalid.

You might further argue that this would create incompatibilities between different systems and so of course everyone would just use the same data file format. Even today, this seems implausible — JSON, various dialects of CSV (with tabs, commas, doublequote-delimited commas, pipes, and colons being the most common delimiters), SQL dumps, and HTML are all in common usage — and in the context of the 1960s and 1970s it seems even less founded. Remember that there were at least five widely used conventions for how to separate lines in ASCII text files up to the 1980s: \r\n (from teletypes), fixed-width records of 80 bytes (from punched cards), \n (from Unix), \r (from PARC, used in Smalltalk, Oberon, and the Macintosh), and \xfe (Pick, see below). And the PDP-10 used a six-bit variant of ASCII they called SIXBIT, the PDP-11 used ASCII, IBM used EBCDIC, and UNIVAC used FIELDATA.

This is a time when even computers from the same manufacturer couldn't agree on how many bits were in a character, much less how to delimit fields in data files. Thus even my attempt to save your argument is invalid.

Interestingly, there was a popular system that worked this way, with non-printable delimiters to divide up different levels of a hierarchical data structure represented as a string: Pick. But Pick didn't use FS, GS, RS, and US in storage either, and although it did use them in the user interface, it used them backwards. Pick's "items" (usually used like records in a database, but accessed like files in a directory) were divided into "attributes", corresponding to database fields, by the "attribute mark", byte 254, displayed as "^" (or often as a line break) and entered as control-^ (RS); the attributes were divided into "values" by the "value mark", byte 253, displayed as "]" and entered as control-] (GS); and values could be divided into "sub-values" by the "sub-value mark", byte 252, displayed as "\" and entered as control-\ (FS).

Pick also reserved byte 255 to mark the ends of items, like NUL (\0) in C, or ^Z (\x1A) in CP/M or early MS-DOS. It called it the "segment mark", displayed as "_", and it was entered with as control-_ (US).

Note that ASCII-1963 had eight separator control characters instead of just four: http://worldpowersystems.com/J/codes/#S0

This is one of the great things about HN, and indeed the internet at large. Someone posts a link to a little quirk of our world, and then someone is able to pipe up and say "hey that's me! I made that thing!" Even better when it's something like a software bug.

Yeah but for every superstar post like this there are thousands of equally upvoted yet ignorant comments that play on the sympathies/bias of HN readers for upvotes.

I wish we had more of the former and less of the latter. I suspect no forum is ever safe from eternal September.

What do you suggest as a means for achieving that end?

Cites from journals (not politicized websites like huffpo, guardian, etc) , credential listings, publication listings, etc.

Something like this could filter out the "random geek wanna-be with an axe to grind" type post.

Essentially a reputation system, based on publisher or specific known authors then?

How would, say, something like Thompson's "Trusting Trust" (though I suppose that was published in ACM or IEEE), or a Dijkstra or Pike blog post, rate?

Comments from, say, Linus Torvalds on the LKML, or Lennert Pottering on systemd, or Bill Gates' various book recommendations, etc.?

Comment proving HN superiority over other pathetic news websites.

This happens pretty often on many forums. It's like the Birthday Paradox. You can't expect some creator to chime on on some specifically-selected topic, but relatively often some creator will give their input on some random topic. In the same way, one cannot expect two people in a relatively small group to share birthday X, but can expect two people to share some birthday in the same group with high probability.

This is why I like HN. Opening the comments and seeing "oh maybe this is due to something I did back in the 1980s when I wrote most of that, here's some inside info/explanation" about something as significant as Excel's file handling is something that will never grow old for me.

Spreadsheets still suck at leaving my data alone. At least Excel has that ridiculous import facility where you can instruct it to leave certain columns alone.

Google Sheets are surprisingly the worst in this department. It's pretty much impossible to prevent Google Sheets to convert 5/7 to a date and 0123 to a number (losing the leading 0 of course and rendering the data invalid). No, ' is not the answer.

This is why HN is cool. You see a Microsoft knowledgebase article documenting an obsolete enterprise software's bug and its workaround, then an anonymous account takes credit for it. This is what I come here for.

Maybe this was written quite a bit later, but can you explain why the following would allow data to be returned in Excel 97?

"If you move your mouse pointer continuously while the data is being returned to Microsoft Excel, the query may not fail. Do not stop moving the mouse until all the data has been returned to Microsoft Excel."


I've always wanted to know why Method 2 works!

Some kind of weird event loop issue! Previous HN discussion on this topic goes fairly deep:


32 years ago! What are the chances your code has survived that long in a product like Excel?

Admittedly this is for Excel 2003, so only ~20 years. I think there's a good chance that this bug was theirs ;)

The bug is still there, and my software still has to work around it in my CSV emitter (force redundant quoting so the first byte will be '"' instead). Boo, I say!

Kudos for standing up and saying "Hey, may be my fault"

I'd like to know if you (or someone) knows who had that stupid idea of localizing csv files (for those who don't know, excel in spanish uses semicolon for separating values, because commas are reserved for decimals, i'd prefer some standard behavior rather than localized unstandard behavior)

A tangent here. Imagine if this had been an open source effort with good version control. You (or someone else) could've gone to that time and seen all the related changes, possibly some context and other aspects about this bug. :)

For 1986, that's fairly unlikely.

At G+, Noah Friedman, who's part of the team who worked on the code, has inquired occasionally about the availability of some early Emacs code. Pre-1990, if I recall.

I'm not sure he ever turned that up even as a standalone tarball, let alone from a revision control repository.

There is BSD revision control history going back to the early 1980s, e.g. http://svnweb.freebsd.org/csrg/sys/sys/errno.h?view=log

Impressive, thanks.

Ha.. this one is from a land before time..

Was it implemented before the generic CSV import was implemented?

It seems like if parsing fails it should throw an exception and fallback to regular csv parsing.

Am I missing something?

I suspect the parsing for CSV and the parsing for SYLK are completely different pieces of code, and the appropriate one gets called once a determination of the file format is made. This would be in an upper level where you wouldn't want to put format-specific behavior. The problem then is that the code determining the format doesn't dig far enough into the file to make sure it's valid before sending it to the parser.

That's probably how it has been fixed. I cannot test right now, but since the article doesn't apply to newer versions of Office, I guess they are no longer susceptible to that.

This is so funny, I was literally fixing this bug last night. Still applies to office 2013

Another bug with CSV files is that opening a CSV file triggers a full recalculation, even if the calculations are set to manual. This is clearly a wrong behaviour.

Nope - just no longer admitting to it. Had to redo a bunch of exports because of this recently.

I've run into this error many times for a decade, always when I make a CSV file of a table whose first column is "ID".

It seems to me that Microsoft by now could have improved its tests. If the first two letters are "ID", but if "there are no valid SYLK codes after the 'ID' characters," then maybe it was never meant to be an SYLK file. If the file's suffix is .csv, then maybe you should just treat it as a CSV file. If the file's suffix is .txt, then maybe you should treat it as a text file.

I've run into this bug too and it was mystifying and (iirc) not overridden by manually loading the file using the Text Import wizard. But this...

"If the file's suffix is .csv, then maybe you should just treat it as a CSV file"

doesn't make me any happier, because there are a lot of slight variations on CSV files or a lot of different ways you might want to load a CSV file (perhaps you'd like to specify the character set, or set columns to be text or date format, or not have leading zeroes stripped or numbers in brackets turned into negative values), and if I absentmindedly name them .csv then again I can't use the full loader, because Excel knows best.

And if you want to read a CSV file with a VB macro it's even worse, because you can specify all the parameters but it just silently overrides them all with the CSV defaults just because the file extension is CSV. Hours of debugging spent on that...

Yes, I always use .txt when exporting/importing CVS from/to Excel. Fun!

If you're a fan of this sort of thing, try the following in CMD. (It helps if you read it in the voice of a 90's rapping cartoon cat.)

echo MZ is my name my name is MZ. I'm the hippest loader bug from the sea to the sea! >foo.txt


What does this do? I don't have access to a windows machine.

For historical reasons, the MZ is interpreted as the header of a 16-bit executable. When you run something under CMD (hopefully I'm getting this right, it's been a few years) it first runs it using the CreateProcess API, which just tries to run the file as an executable, regardless of extension, then it falls back to ShellExecute if it doesn't have an MZ header, which dispatches based on extension.

Basically you get a dialog saying the text file isn't a compatible executable. If you change the MZ to anything else, it opens in notepad.

Well, you are trying to execute the text file (with ShellExecute as a fallback). Windows does what you ordered it to.

Windows 7 64 bit here - it just has the error message box, and the same text in the cmd window, it doesn't open the file though :)

"This version of foo.txt is not compatible with the version of Windows you're running. Check your computer's system information to see whether you need a x86 (32-bit) or x64 (64-bit) version of the program, and then contact the software publisher."

That's because it's treating it as a 16-bit DOS executable (since it has the 'MZ' magic number but not the rest of the PE header that tells Windows that it's 32 or 64-bit). 64-bit Windows can't run 16-bit executables at all.

32-bit Windows should fail a little later in the execution process; it can run 16-bit software, but your text file is missing the rest of the MZ-format header.

Alrighty, then, on Linux, just for you.

[fred@dejah launch]$ chmod +x foo.txt [fred@dejah launch]$ ./foo.txt fixme:winediag:start_process Wine Staging 1.9.12 is a testing version containing experimental patches. fixme:winediag:start_process Please mention your exact version when filing bug reports on winehq.org. winevdm: Cannot start DOS application E:\launch\foo.txt because the DOS memory range is unavailable. You should install DOSBox. [fred@dejah launch]$

I have Wine installed. Same defect as Windows, which is either good or bad -- this is hard to determine.

At least, the hint of installing DOSBox is reasonable. So, let's do that:

[fred@dejah launch]$ sudo dnf install dosbox

and try it again:

[fred@dejah launch]$ ./foo.txt fixme:winediag:start_process Wine Staging 1.9.12 is a testing version containing experimental patches. fixme:winediag:start_process Please mention your exact version when filing bug reports on winehq.org. DOSBox version 0.74 Copyright 2002-2010 DOSBox Team, published under GNU GPL. --- ALSA lib pulse.c:243:(pulse_connect) PulseAudio: Unable to connect: Connection refused

CONFIG: Generating default configuration. Writing it to /home/fred/.dosbox/dosbox-0.74.conf CONFIG:Loading primary settings from config file /home/fred/.dosbox/dosbox-0.74.conf CONFIG:Loading additional settings from config file /home/fred/.wine/dosdevices/c:/users/fred/Temp/cfgcbaf.tmp MIXER:Can't open audio: No available audio device , running in nosound mode. ALSA:Can't subscribe to MIDI port (65:0) nor (17:0) MIDI:Opened device:none [fred@dejah launch]$

Doesn't work on 64 bit, but I seem to recall something about the first 2 bytes of a PE header. :)

I have absolutely hit this bug before, multiple times. However, I never put it together that the 'ID' was causing it.

I can think if things breaking for each of those.

Reminds me of a charset detection bug in some versions of notepad [1]:

1. Open new file

2. Type "Bush hid the facts"

3. Save the file

4. Open the file

5. The content of the file have changed to "畂桳栠摩琠敨映捡獴"

[1] https://en.wikipedia.org/wiki/Bush_hid_the_facts

There was a similar sentence in Brazilian Portuguese. Something about our largest TV network (Globo) lying about the existence of Acre (a state people jokingly say doesn't exist, like North Dakota in the US). I can't remember the exact sentence now though.


acre vai pra globo?

Oooh, that was it. I had a completely wrong recollection of it, so my previous comment is BS now :/

I've seen one chardet issue on some Python library where detection fails and causes character corruptions if there's only one type of Unicode character present, and everything else is ASCII characters. I pity anyone getting stuck on it.

So this would be another problem that could be avoided if "CSV" would actually friggin mean "comma separated values" and not "whatever Windows might consider comma-equivalent depending on locale and/or phase of the moon"?

No. From the detailed description:

> A SYLK file is a text file that begins with "ID" or "ID_xxxx", where xxxx is a text string. The first record of a SYLK file is the ID_Number record. When Excel identifies this text at the beginning of a text file, it interprets the file as a SYLK file. Excel tries to convert the file from the SYLK format, but cannot do so because there are no valid SYLK codes after the "ID" characters. Because Excel cannot convert the file, you receive the error message.

This is usually called "magic string" or "magic number" (https://en.m.wikipedia.org/wiki/Magic_number_(programming)). It has nothing to do with the comma, and everything to do with SYLK using a pretty risky magic string (ID) and Excel not having a "try SYLK, if that fails, try as csv".

tl;dr: this is about guessing the input format and has nothing to do with the delimiter.

It's not just ID, there's also mandatory fields after that, separated by a semicolon. Thus 'ID;P' would be a valid header for SYLK and e.g. for German "CSV"s. If CSV would stick to commas (and boohoo, just escape more fields when this causes locale issues), then even such overly simple heuristics would've sufficed, wouldn't it?

Although you'd just run into other pathetic cases with such an informal format (CSV, not SYLK). But hey, I need to interchange formats with some IT guys and don't have a big IT department (ready) and they don't want to parse XLS(X), so I just use CSV. Now you have two problems.

"ID;P" is valid at the start of CSV as well. Fields are allowed to contain semicolons.

Depending on locale.

I believe in the Canadian French locale (and maybe many other locales), ";" is considered a separator (a "," equivalent).

Sure, but mhd was saying that if we stuck to CSV being actually comma-separated, "ID;P" would distinguish CSV from SYLK. I'm saying that it wouldn't.

mhd doesn't say that:

Thus 'ID;P' would be a valid header for SYLK and e.g. for German "CSV"s.

"If CSV would stick to commas ... then even such overly simple heuristics would've sufficed, wouldn't it?"

Ah, thanks. Apparently, mhd knew and forgot :-)

The problem of discriminating between csv and SYLK files is unsolvable, as every file with only ASCII characters and no commas or quotes is a valid csv file (with one column) and there are valid SYLK files in that set.

In the general sense, yeah, but you would think it shouldn't be too hard to make a reasonable guess about which one it is and handle failures gracefully instead of assuming that it's SYLK based on a tiny amount of data and then blowing up when it turns out not to be.

Valid, but not nearly as likely to run into by accident.

What's a comma if not a Unicode character that's comma-equivalent in the current locale?

I suspect you're thinking of a comma as ASCII 44. If your source text isn't ASCII then it's just not this simple.

I expect what the parent is truly imagining (but hasn't really thought through the usability implications of) is a format where ASCII newline + ASCII comma are used as delimiters, and then everything else is encoded further so as to basically become opaque octet-strings (whose encoding is such that it will never include '\n' or ',').

In other words, if you take CSV and make it completely useless for hand-editing in a text-editor by taking it 90% of the way to being a length-prefixed binary encoding, you'll have what the programmer's mind intuitively jumps to as "what CSV should be."

No, what OP is talking about is the fact that Excel formats and parses CSV files differently depending on locale settings. For example, here in mainland Europe, Excel thinks CSV files should look like this:

And in fact, to open an "American" CSV file in Excel you either have to set Windows to American locale settings, or use the text import wizard or the "text to columns" feature. If you don't use the correct settings, Excel will not complain, but corrupt the data (sometimes subtly).

I hope you agree that it is INSANE to make a file format depend on locale settings. As someone who writes software for scientists in Europe it leads to endless confusion and annoyance.

(And even if you do everything correctly, Excel will still mangle certain CSV files, interpreting phone numbers as dates etc...)

Oh wow, insane is the right word for that.

(Once long ago in my youth, I got bitten when the quoting app I was writing worked fine on our test server, but got the day and month fields swapped on our customer's server. And that was when I learned that you shouldn't pass dates through ODBC as text strings...

SQL Server was parsing dates based on the Windows date settings. Our test server was set to the default U.S. date format but the customer's server was set to DD/MM/YY, or vice versa, I can't remember now. What made it worse was that we handed the system over early in a month, and it worked fine for over a week...)

We got a bug report of this exact issue yesterday, and it literally stunned me. Who on earth thought that was a good idea?

Actually newlines and commas are OK, and in fact arbitrary bytes can be put into CSV, because officially[1] CSV supports escaping using quotes (and escaping quotes by doubling them).

[1] https://www.ietf.org/rfc/rfc4180.txt

An alternative would have been to not label it "CSV",

Another one would have been to write numbers in a CSV file in the American format and just adapt them to the current locale when reading them.

Both would be a more sensible interpretation of your GP.

No more sensible than starting with Japansese format and force the Americans to adapt to their locale when reading them....

That'll make the world a better place wouldn't it?.....

Why not just use ASCII 31/30 instead of comma/new line. Just give it a new name like ASV (ASCII separated values). What would take to give glyphs to those characters? Just add them to some font?

Pretty sure any locale-aware OS treats comma different depending on what locale has been used, no? Anyway commas are historically used as seperators in numbers in some countries, and in text, etc, so if you really want to shed your blame all over something, maybe it shouldn't be Windows here but rather the people who came up with comma as a record seperator in the first place. Probably it was ok for them in that time, at that place (US I'm guessing) and for the usage needed back then, but I'd really wish they chose a different seperator.

Which "works", in the same way that national character sets "work". Until you start exchanging data across locales ("duh, easy: just don't talk to the Americans, solved!").

Jeez, how many SLYK files out there? I venture to guess there is an order of degrees more attempts at opening CSV files with ID as the first header than attempted opens of unconverted SLYK files in excel.

When I admire Microsoft, it's most often because despite nearly 4 decades of bloat to support, they still can ship reliable software to millions.

I kind of wish I could just use a version from a decade or 2 ago. The (4 year old) versions of word and excel that the uni site I work for use are so clunky to navigate. I loved using word in pre OSX says - I wonder how much of that is rose tinted nostalgia.

Try LibreOffice (Portable if necessary) - it has remained with the conservative UI.

"A SYLK file is a text file that begins with "ID" or "ID_xxxx", where xxxx is a text string. The first record of a SYLK file is the ID_Number record. When Excel identifies this text at the beginning of a text file, it interprets the file as a SYLK file. Excel tries to convert the file from the SYLK format, but cannot do so because there are no valid SYLK codes after the "ID" characters. Because Excel cannot convert the file, you receive the error message. "

In other words, Excel has all the information to decide that this is not a SYLK file but a CSV, but just throws an error, because fuck you. Great UX.

That's way up there with the "you can't drag stuff here, should've dragged it a few px further" from Win98.

Who's to say that it should fall back to CSV instead of giving an error? What if it was actually a malformed SYLK file where further headers were mangled in transmission?

I do think it should be giving back a more descriptive error though, possibly one informing the user that it thinks it is a SYLK file and giving them the option to interpret it differently.

That is Excel team's choice, of course. However, I would have thought that it would be based on the usefulness of the given import type currently, not in 1988 (I trust there have been new import filters since)

How many SYLK files did you see recently, e.g. within the last 30 years? For me, the result is 0 (zero). As compared to innumerable swarms of CSVs of all flavors. Yet even MSO2013 prefers Yonder Hiftorical Curioufity; whence Excel's fondness and preference for obscure and rare formats, I have no idea.

And yes, giving user at least an intelligible error message would be nice (I hold no illusions that popping up a selection would be a complex feat: import logic is usually a gnarly place).

I feel like this is one of those bugs that would have taken less time to fix than it took to document.

You could even do something as trivial as "if Excel fails to open the file as SYLK, try again as CSV" and cut at least 99% of the problem away.

I feel like this is one of those bugs that would have taken less time to fix than it took to document.

After many years, I still have to slap myself when I catch myself thinking this way. Unless you know the architecture of the software, I've learned that it's often significantly harder than you imagine to fix seemingly simple and obvious bugs without breaking something else.

Fine, a slight correction then. It should be easy to fix except for terrible software architecture. Terrible software architecture is not a valid excuse when you have plenty of time and billions of dollars. So there should be little forgiveness for this bug still existing.

As a consumer affected by this bug, sure, you may find it inexcusable. But the basic truth about any mature software product used by tens of millions is that there will be a laundry list of bugs competing for limited resources. So there will always be bugs that you want to fix but simply isn't a high enough priority relative to all the other bugs or features.

There are limited resources, but there are enough resources to improve the architecture over more than a decade. And if they decided to avoid architecture changes, there would be nothing in the way of throwing lots of siloed software engineers at bugs to get more of them fixed. They're not lacking for money with so many customers. It's possible for them to set a much more inclusive cutoff.

They could have fewer bugs per feature than almost any other product after being stable and popular for so long. But they choose not to make that a goal.

Particularly since it's not like if Microsoft was adding new features to Excel since office 2007.

It is harder, more expensive and time-consuming to fix bugs later than sooner.

This is basically what happens, at least on my Mac - it complains about it not being a valid SYLK file, then I think, "I have no idea what that means", then I dismiss the error and it opens it up as a CSV.

At least now I know why!

The lesson here is when choosing your new file format's magic bytes, don't use your initials or something clever. Have the decency to use a few actual random bytes as your format's identifier.

For plain text files, it's a little bit different, but there's no excuse for this problem happening in a binary file format.

Switching topics, I'm imagining a solution to this problem where you don't actually know which format it is, you have a streaming processor for each thing it could be, and feed it to each processor one character at a time. Whenever a processor returns an error, drop it from the list. If the last processor in the list returns and error, report that error to the user along with which format that processor was for. If multiple peocessors complete successfully, you'll have to either rank them, or ask the user what format it is.

SYLK isn't binary, it's all ASCII.

Interesting idea on passing the file to multiple parsers simultaneously, although I don't see a benefit over trying them serially.

I had to check this is not the 1st of April

I particularly like the step by step explaination of how to add an apostrophe at the begining of the first line with a description of which keyboard key to press...

I have three apostrophes in my (Latin America) keyboard: ` ' ´ but IIRC Excel only recognizes one of them.

I had to fake a csv file for testing yesterday, and started out with id, then went "ahahah", and used name.

I know that's boring but this is one of those generational pieces of knowledge like "keep your docs up to date" that we need to build into software training somehow. (Or rather, this bug is not important, but the kind of training that imparts this knowledge will be vital in building a real software profession)

But yeah, it's a real WTF

Embedded knowledge like this, is why guys like us can never retire...

Excel is useless with csv and utf-8. when I need to export to csv, which I regularly do, I always use Libre Office.

Excel is quite often useless with CSV, even if its contents are pure ASCII. Whether or not Excel will open a standard CSV file delimited with comma's (as per RFC 4180) properly also depends on the locale of the computer running Excel. So if you run Excel on Windows with a Dutch locale, it will by default fail to open the CSV normally and jam all its contents in one cell, because it expects a semi-colon as delimiter instead of a comma. To open standard CSV you have to import the file instead.

Someone at some point decided that the Dutch use semi-colons instead of commas to delimit tabulated data, and decided to apply this logic to CSV files as well. I would like to know the history behind such a decision! Switch your locale to English US, and it opens normally.

LibreOffice just opens the CSV and asks the user to confirm the delimiter and what not regardless of locale.

Yes, this is so horrible. As a workaround I sometimes put "sep=," on the first line of the spreadsheet. This forces Excel to use the comma as a separator, regardless of the locale. The downside of course is that every other tool on the planet shows the extra row, so it's not very useful.

I suffer from the reverse problem. I need to ingest CSV files produced by excel in various locales. Or is there some generally available alternative to excel that would produce valid RFC CSVs?

There is no such thing as a valid CSV - even the spec is borked, and nobody follows it anyway. If you have a choice, XLSX is a lesser evil.

RFC 4180¹ has been in existence since 2005 and seems perfectly straight forward. The basic syntax is two pages of very understandable text. I hold that to be the definition of valid CSV.

CSV libraries tend to adhere to it (and often support additional options encountered in the wild as well); e.g., Apache Commons CSV².

1: https://tools.ietf.org/html/rfc4180

2: https://commons.apache.org/proper/commons-csv/

Seems is the correct word here. Alas, it's one of multiple attempts at standardization ("I hold it" - good thing about standards is there's so many to choose from, as this RFC acknowledges), ignores charset altogether (a flaw of CSV, not of standard), and anything touched by MSO is unlikely to fit this. So, while useful for emitting CSV, no use at all for processing.

Is there a way to report a bug to excel asking for proper RFC 4180 support?

Probably, but what will it achieve realistically? This (both what OP reports and this locale-dependent behaviour) is likely a WONTFIX type of bug, as there are workarounds (don't double-click to open normal CSV if you are in a locale where different delimiters are used, use import; don't start your CSV files with "ID"; etc.).

Importing csv is very messy in excel.

A VBA import routine will always run in the US locale. However, csv files containing dates formatted according to other locales will have their days and months switched as long as the day is not bigger than 12.

Wait... which excel? 2013 definitely gives me a choice of separators when importing.

> when importing.

Exactly. Unfortunately, double-clicking the file does not trigger that dialogue. You would be surprised how many IT-staffers simply give up at that point („Can I get the raw data?”, „Sure, here is the automatically generated CSV dump.”, „I can't open it…”).

I can't remember the last time double clicking on a CSV file did the right thing in Excel. I always use the clunky import dialog. Even then it often requires a few tries before I get what I want.

For an operation that is done literally millions of times around the world each year, Microsoft seems to pretend that it's some bizarre corner case. I know CSV is a messy format, but you can do better Microsoft. Even a week or two of work on that could make it much much better.

Surprisingly, Excel can import CSV from UTF8 with BOM (at least new versions of Excel). However, be careful, as Ctrl+S will save it in some arbitrary codepage and you'll probably end up with a document like:

ID,Name 1,??? ???? 2,Kevin Dub?is

so take care to always save such documents as .xmlx.

(If you convert them to your local codepage and import that - when at all relevant - it will be saved in the local codepage, so less risk of data loss.)

> Surprisingly, Excel can import CSV from UTF8 with BOM (at least new versions of Excel).

A BOM is abnormal and not recommended in UTF-8, it's basically a shitty MS hack: the BOM is necessary to detect endianness differences between the document and the host system, endianness has no impact impact on UTF-8.

And as freak_nl above notes, Excel uses locale-dependent field separators, so in some countries/locales it will try to import "CSV" with semicolon separators.

It actually does have CSV and UTF-8, but finding the correct settings is very hidden. I'm always amazed at how unable Microsoft is to improve the base of Excel and instead just keeps piling UI layers on top to hide the ugly places.

I refuse to open CSVs in Excel, because so often perfectly valid CSV gets butchered by it. It's easier to just use Notepad++.

A nice weekend project would be a converter from CSV to xslx or SpreadsheetML [0]

[0] https://en.wikipedia.org/wiki/SpreadsheetML

We had a big wtf moment yesterday at work.

Also see https://en.m.wikipedia.org/wiki/SYmbolic_LinK_(SYLK)

"Applies to Microsoft Excel 2002 Standard Edition, Microsoft Excel 2000 Standard Edition, Microsoft Excel 97 Standard Edition, Microsoft Office Excel 2003 "

I still get the error in Excel 2016, so I guess it's not fixed yet.

We get a warning in Excel 2016, and can open the file.

Ok... that workaround is just terrible. Add one single-quote to the first line?!? What about a closing quote? And wouldn't the side effect of this be for the first line to be treated as one value (instead of separate values for each header)?

The workaround needs it's own workaround :)

There are times when I think the source code of the Unix "file" command should be a required part of languages’ standard libraries. I have never seen anything do a better job of describing contents accurately.

It’s really silly to see applications have trouble understanding data. Even on macOS, where "file" is installed, the graphical interface sometimes fails to open/preview something that the command-line "file" describes perfectly (i.e. if it’s really text, just show me the text).

For all its magic, even file is fallible. https://news.ycombinator.com/item?id=8171956

I don't know enough about SLYK to suggest a full solution, but couldn't additional checks be performed that confirm it's a SLYK file? Other than all SLYK files starting with ID, are there any other properties that exist in all SLYK files that could be checked against?

So why didn't SYLK files get their own extension? Having to parse the file to figure out what it is? Seems so error prone. But maybe Machine learning will solve this too. Isn't that why extensions were invented- to reduce cognitive load?

They got, but many systems that produced them didn't have such a thing as a filename extension. Microsoft Multiplan, for example, ran on Commodore 64, CP/M, TRS-80, etc. (https://en.m.wikipedia.org/wiki/Multiplan)

And even if they did, chances were that transporting the file between machines (no, you couldn't move a floppy disk between machines, even if both machines had a floppy disk. Typical transport involved sending data over a serial line that only guaranteed to transfer 7 bits/byte, another reason why SYLK is ASCII) lost the extension.

Weird, I hit this exact error 2 weeks ago

Ditto, two days ago.

One thing they could do is just create a new extnesion, like .CSVX and treat it as CSV file per RFC. I wish.

I've been getting this error from CSV I created in Ubuntu. Glad to have learned what this was about.

Could that be worked around by enclosing the field in quotes?

one of those, how did they miss this?

I get this error all the time.


Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact