Hacker News new | past | comments | ask | show | jobs | submit login
An Unexpected Character Replacement (datafix.com.au)
59 points by eaguyhn on Oct 17, 2019 | hide | past | favorite | 22 comments



These angle brackets are how R handles printing some Unicode to stdout on windows. In memory it should be regular Unicode though.

The mistake they made is a classic R footgun: the fileEncoding argument to write.table() controls the encoding of the filename, not its contents. You either have to control the encoding of the files by manually creating the connection through file() or ideally just use the readr or data.table libraries.

Base R makes a lot of Unixy assumptions when it comes to text, so it's not pleasant to work with on Windows. The package ecosystem has solved most of these problems though.


I'm willing to bet that this problem was actually caused by readr, as base typically doesn't do crap like this.

Completely agreed with the rest of your points though.


I used to have a job that involved parsing large textual datasets. It was fascinating to me how far you could reconstruct a history of a dataset just by looking at encoding errors - and practically no dataset I've seen came without them. Sometimes I could be certain of several specific import/export steps, each introducing a new layer of encoding errors on top of the previous one. Other times I could correlate timestamps and see when specific data entry bugs were introduced and when they were fixed.

Strictly speaking once you lose the information about the encoding of a string you can't say anything about it. But given some heuristic, some contextual knowledge (like how the author of the post guesses that "M<fc>ller" means "Müller") and a large enough amount of data you can pretty much always work back through and correct the errors. Well, as long as someone didn't replace all 8-bit characters with question marks, but that was very rare in my experience.


> The dataset started out as a Microsoft Excel file, presumably in Windows-1252 encoding. This was converted into a CSV, then loaded into the R environment for adding additional information required by GBIF, then exported from R as a text file with the command option fileEncoding = "UTF-8".

Quite a journey. The ultimate culprit was an R text cleaning function but I wonder why the Excel sheet was in Windows-1252 encoding and can't R import Excel files directly?


Much of Microsoft Office for Windows dates to an awkward period when it was apparent that ASCII isn't enough, but UTF-8 isn't yet the obvious winner.

In particular there's a period where you get "strings" that aren't actually text as we'd understand it but instead a sequence of glyph numbers for a typeface. So instead of ASCII's "A" or Unicode's U+0041 LATIN CAPITAL LETTER A you're just encoding that you want whatever the typeface named "Typewriter Sans" has put in slot number 65. Maybe it's a capital A and maybe it isn't.

This works pretty well on one Windows PC, or even a LAN full of identically configured Windows PCs. You can see why it was at least superficially attractive to Windows application programmers. Got a niche that needs Bulgarian ? No problem, the software doesn't care what the glyphs "mean", just define how to type the codes in and you're done.

But then somebody tries to open the Excel file with the Bulgarian text in "Steve's Bulgarian" font on a Mac and it doesn't work, it's gibberish. Oh dear. You have to install "Steve's Bulgarian" on the Mac and it look better although it isn't quite exactly right.

Today this is obviously completely insane, but by then the problem is backwards compatibility.


There needs to be an "alternateTimelineHackerNews" subreddit where each post picks a historical technology, removes it from history, and then tries to argue convincingly for the status quo because that now-missing technology could not possibly be created.

I'd love to read a thread of apology theater for these "glyph slots" and how something like unicode/utf-8 would never work as a replacement. :)


Well, for one thing you'd have to form some kind of giant committee to decide on what characters actually exist, and assign them all unique numbers. Can you imagine how ridiculous that would be?


Hehe.

It's interesting to think about these:

* unicode -- I think the naysayers would win here because there are way more people today who understand the security and complexity issues surrounding the topic than those who understand the details of why unicode and the various encodings work the way they do

* gnu -- this would be a great challenge to the naysayers. But I think it could be done if the naysayers really studied the hardware of the time and could ask pointed questions that a Stallman impersonator couldn't answer.

* fftw - "hey we're gonna build a library that queries at runtime for whatever SIMD instructions are available on the user's computer and choose the fastest algo." Yeah right. :)

The possibilities are bounded but many.


Said committee would be mired in politics of what should and shouldn't be included. Cuneiform? Klingon? Clip art? They are all technically human communication.


The comment you're replying to is joking about the Unicode Technical Committee, which of course exists and really does what was described. So it's sarcasm.

But the Technical Committee doesn't spend very much time worrying about whether it should include Cuneiform (yes), Klingon (no) or Clip Art (mostly no but there are some things you might think of as "clip art" that are in ISO-10646 / Unicode).


Yes, I'm aware of this. My post is also made in jest along the same lines, albeit with a grain of genuine distaste for what is and isn't allowed in Unicode.


So much better to have local control instead of ending up beholden to a committee of busybodies.


Yeah, I think the local control argument would be quite effective against a number of technologies.

In fact Linux mailing list comes standard with its own early argument from Tannenbaum against the monolithic design. So no alternate history necessary there. :)


Just some specific dates.

Microsoft Office was released in 1989 [1], and came out on the Mac before it was ported to Windows.

UTF-8 was designed in 1992, on a placemat [2]. It has been dominant on the web [3] since 2009, which seems like ages ago now, but is really quite recent.

[1] https://en.wikipedia.org/wiki/Microsoft_Office

[2] https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

[3] https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowt...


That's a misleading date for UTF-dominance -- as the link notes, it's counting things as ASCII if they only have ASCII characters, no matter what the encoding was specified as, and ASCII is a subset of UTF-8.

I'd put UTF-8's real prominence with the popularity of XML (where it was the default encoding). That was specified in 1998, and ubiquitous within a few years. Nobody at the height of XML would have tried doing a non-UTF design for anything.


XML is almost UTF-8, but some characters were disallowed in XML 1.0, and it wasn't fixed until later. This still causes bugs where I work, here in 2019.


Microsoft Office was initially just a bundle of separate programs, the constituent applications are even older (though Excel still debuted on the Mac first, in 1985).


My Windows ME with IE 5.5 supports UTF-8.


The whole encoding thing was so annoying. I had experience with this when learning computing, in the 90s in Latvia. The common use case was needing both Latvian and Russian languages, at least the ability to read them. Neither was convenient.

Russian had its own Cyrillic encodings, but of course there wasn't a single one. Much of the time it was Windows-1251, which you can still run into when browsing the Russian internet even if UTF-8 dominates. But sometimes it would instead be KOI8-R, a different encoding. So you had to know how to convert encodings with special tools, and of course you needed fonts that had the right glyphs in the first place.

As an extra fun note, Windows-1251 uses the code point 0xFF for the letter я, which could break poorly-written software that used 0xFF as a sentinel value of some kind.

And then on top of that, there was dealing with Latvian. It's a small and relatively obscure language, using an extended Latin alphabet, with Latvian letters like ā, ž, ļ and ķ. So that meant another encoding, Windows-1257, and needing the fonts - some Latvian letters like ž or č existed in many fonts due to also existing in some bigger languages, but ā or ļ were less common. Unlike Russian, Latvian documents would at least be understandable if rendered with the wrong encoding, but input was a bigger problem. Windows had no out-of-the-box support until Windows XP. So from Windows 3.1 up until XP, additional third-party software was required to type Latvian, and the software would do things like hooking into the keyboard driver, which was of course not always side-effect free.

There are things that have gotten worse in computing, but the victory of Unicode is a huge step forward. Up until fairly recently, software was very obviously geared towards the 26-letter Latin alphabet, which means English (or Malay).


A pretty good overview of the awkward period with Excel: https://donatstudios.com/CSV-An-Encoding-Nightmare


By R should be in the title?


Where would the spoiler alert go if it's in the title?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: