Maybe my bug? I wrote most of the original Excel text-based file import code in 1984 (including the SLYK format, which was the main way we migrated files from Multiplan to Excel).
Hi mckoss, awesome to hear one of the main Excel architects is here on HN!
Any interesting stories to share? I think we'd all be interested.
Mind if I ask a question directly related to one of my comments a few days ago? I was lamenting the fact that the ASCII codes 29, 30, 31 (Group, Record, and File separators) never really became widely implemented, as these were specifically designed to delimit data. Ie, one could easily include commas, new lines/carriage returns, etc in data cells without clashing. But instead CSV seems to be the most common standard for tabular data.
Were these ASCII codes ever considered for tabular files?
You couldn't conveniently type those codes on a keyboard and there was no standard way to display them. You'd need a special editor for such files, eliminating the main advantage of an interchangeable text based file format.
Part of your response is accurate as to 'why' dedicated separators never became mainstream - software compatibility - but frankly 99% of csv files were/are never meant to be hand edited, simply used as a neutral file. For the later usage, dedicated separators that do not require escaping or import settings would be major improvement.
For files not intended to be hand-edited, it doesn't really matter what delimiters you use. It's not that difficult to code up char-escaping for csv.
At Yahoo, the typical delimiters for logs and whatnot were ctrl+a, ctrl+b, etc. It was slightly nicer than CSV, but only slightly. It was mostly nicer when manually inspecting files with columns that had embedded commas (that otherwise would have been escaped). The machines don't care, and for any interesting processing you'd often end up with escaping anyway.
But csv files can also be manipulated in shell scripts, with the text being piped from one util to the next. Here it becomes important that the separators can be correctly typed and displayed.
The standard way to type them is as control-\, control-], control-^, and control-_, which has been true since the introduction of control keys in the 1960s. It's true that there isn't a standard way to display them across platforms; some use ␜, ␝,␞, and ␟, while others use ^\, ^], ^^, and ^_ (sometimes in inverse video), and maybe the most common display would be the CP437 glyphs, which render them as ∟, ↔, ▲, and ▼. In the Unicode standard, they are displayed as dotted boxes around the strings "FS", "GS", "RS", and "US".
Emacs can edit files in this format without any extra work (it displays them as ^\, ^_, etc., with a different text color so that you can easily distinguish them from character sequences like "^" followed by "\") but maybe you mean to say that Emacs by itself doesn't understand the hierarchical structure of such a file.
This is easily fixed. You can get Emacs forms-mode for a file with these delimiters as follows:
(setq forms-field-sep "\036")
(setq forms-multi-line "\037")
(setq forms-read-file-filter 'forms-replace-gs-with-newlines)
(setq forms-write-file-filter 'forms-replace-newlines-with-gs)
(setq forms-file "fsgsusrs.data")
(setq forms-number-of-fields (forms-enumerate '(name aliases wikipedia employer notes)))
(setq forms-format-list
(list
"Project for a New American Century conspirators\n\n"
"\n Name: " name
"\n Aliases: " aliases
"\n Employer: " employer
"\nWikipedia URL: " wikipedia
"\n\nNotes:\n\n" notes))
(defun forms-replace-newlines-with-gs ()
(goto-char (point-min))
(while (search-forward "\n" nil t)
(replace-match "\035" nil t)))
(defun forms-replace-gs-with-newlines ()
(goto-char (point-min))
(while (search-forward "\035" nil t)
(replace-match "\n" nil t)))
It took me 35 minutes to write this "special editor for such files"; thus your argument is invalid.
You might further argue that this would create incompatibilities between different systems and so of course everyone would just use the same data file format. Even today, this seems implausible — JSON, various dialects of CSV (with tabs, commas, doublequote-delimited commas, pipes, and colons being the most common delimiters), SQL dumps, and HTML are all in common usage — and in the context of the 1960s and 1970s it seems even less founded. Remember that there were at least five widely used conventions for how to separate lines in ASCII text files up to the 1980s: \r\n (from teletypes), fixed-width records of 80 bytes (from punched cards), \n (from Unix), \r (from PARC, used in Smalltalk, Oberon, and the Macintosh), and \xfe (Pick, see below). And the PDP-10 used a six-bit variant of ASCII they called SIXBIT, the PDP-11 used ASCII, IBM used EBCDIC, and UNIVAC used FIELDATA.
This is a time when even computers from the same manufacturer couldn't agree on how many bits were in a character, much less how to delimit fields in data files. Thus even my attempt to save your argument is invalid.
Interestingly, there was a popular system that worked this way, with non-printable delimiters to divide up different levels of a hierarchical data structure represented as a string: Pick. But Pick didn't use FS, GS, RS, and US in storage either, and although it did use them in the user interface, it used them backwards. Pick's "items" (usually used like records in a database, but accessed like files in a directory) were divided into "attributes", corresponding to database fields, by the "attribute mark", byte 254, displayed as "^" (or often as a line break) and entered as control-^ (RS); the attributes were divided into "values" by the "value mark", byte 253, displayed as "]" and entered as control-] (GS); and values could be divided into "sub-values" by the "sub-value mark", byte 252, displayed as "\" and entered as control-\ (FS).
Pick also reserved byte 255 to mark the ends of items, like NUL (\0) in C, or ^Z (\x1A) in CP/M or early MS-DOS. It called it the "segment mark", displayed as "_", and it was entered with as control-_ (US).
This is one of the great things about HN, and indeed the internet at large. Someone posts a link to a little quirk of our world, and then someone is able to pipe up and say "hey that's me! I made that thing!" Even better when it's something like a software bug.
Yeah but for every superstar post like this there are thousands of equally upvoted yet ignorant comments that play on the sympathies/bias of HN readers for upvotes.
I wish we had more of the former and less of the latter. I suspect no forum is ever safe from eternal September.
Essentially a reputation system, based on publisher or specific known authors then?
How would, say, something like Thompson's "Trusting Trust" (though I suppose that was published in ACM or IEEE), or a Dijkstra or Pike blog post, rate?
Comments from, say, Linus Torvalds on the LKML, or Lennert Pottering on systemd, or Bill Gates' various book recommendations, etc.?
This happens pretty often on many forums. It's like the Birthday Paradox. You can't expect some creator to chime on on some specifically-selected topic, but relatively often some creator will give their input on some random topic. In the same way, one cannot expect two people in a relatively small group to share birthday X, but can expect two people to share some birthday in the same group with high probability.
This is why I like HN. Opening the comments and seeing "oh maybe this is due to something I did back in the 1980s when I wrote most of that, here's some inside info/explanation" about something as significant as Excel's file handling is something that will never grow old for me.
Spreadsheets still suck at leaving my data alone. At least Excel has that ridiculous import facility where you can instruct it to leave certain columns alone.
Google Sheets are surprisingly the worst in this department. It's pretty much impossible to prevent Google Sheets to convert 5/7 to a date and 0123 to a number (losing the leading 0 of course and rendering the data invalid). No, ' is not the answer.
This is why HN is cool. You see a Microsoft knowledgebase article documenting an obsolete enterprise software's bug and its workaround, then an anonymous account takes credit for it. This is what I come here for.
Maybe this was written quite a bit later, but can you explain why the following would allow data to be returned in Excel 97?
"If you move your mouse pointer continuously while the data is being returned to Microsoft Excel, the query may not fail. Do not stop moving the mouse until all the data has been returned to Microsoft Excel."
The bug is still there, and my software still has to work around it in my CSV emitter (force redundant quoting so the first byte will be '"' instead). Boo, I say!
I'd like to know if you (or someone) knows who had that stupid idea of localizing csv files (for those who don't know, excel in spanish uses semicolon for separating values, because commas are reserved for decimals, i'd prefer some standard behavior rather than localized unstandard behavior)
A tangent here. Imagine if this had been an open source effort with good version control. You (or someone else) could've gone to that time and seen all the related changes, possibly some context and other aspects about this bug. :)
At G+, Noah Friedman, who's part of the team who worked on the code, has inquired occasionally about the availability of some early Emacs code. Pre-1990, if I recall.
I'm not sure he ever turned that up even as a standalone tarball, let alone from a revision control repository.
I suspect the parsing for CSV and the parsing for SYLK are completely different pieces of code, and the appropriate one gets called once a determination of the file format is made. This would be in an upper level where you wouldn't want to put format-specific behavior. The problem then is that the code determining the format doesn't dig far enough into the file to make sure it's valid before sending it to the parser.
That's probably how it has been fixed. I cannot test right now, but since the article doesn't apply to newer versions of Office, I guess they are no longer susceptible to that.
Another bug with CSV files is that opening a CSV file triggers a full recalculation, even if the calculations are set to manual. This is clearly a wrong behaviour.