
An Unexpected Character Replacement - eaguyhn
https://www.datafix.com.au/BASHing/2019-10-18.html
======
Tarq0n
These angle brackets are how R handles printing some Unicode to stdout on
windows. In memory it should be regular Unicode though.

The mistake they made is a classic R footgun: the fileEncoding argument to
write.table() controls the encoding of the filename, not its contents. You
either have to control the encoding of the files by manually creating the
connection through file() or ideally just use the readr or data.table
libraries.

Base R makes a lot of Unixy assumptions when it comes to text, so it's not
pleasant to work with on Windows. The package ecosystem has solved most of
these problems though.

~~~
disgruntledphd2
I'm willing to bet that this problem was actually caused by readr, as base
typically doesn't do crap like this.

Completely agreed with the rest of your points though.

------
avian
I used to have a job that involved parsing large textual datasets. It was
fascinating to me how far you could reconstruct a history of a dataset just by
looking at encoding errors - and practically no dataset I've seen came without
them. Sometimes I could be certain of several specific import/export steps,
each introducing a new layer of encoding errors on top of the previous one.
Other times I could correlate timestamps and see when specific data entry bugs
were introduced and when they were fixed.

Strictly speaking once you lose the information about the encoding of a string
you can't say anything about it. But given some heuristic, some contextual
knowledge (like how the author of the post guesses that "M<fc>ller" means
"Müller") and a large enough amount of data you can pretty much always work
back through and correct the errors. Well, as long as someone didn't replace
all 8-bit characters with question marks, but that was very rare in my
experience.

------
ChrisSD
> The dataset started out as a Microsoft Excel file, presumably in
> Windows-1252 encoding. This was converted into a CSV, then loaded into the R
> environment for adding additional information required by GBIF, then
> exported from R as a text file with the command option fileEncoding =
> "UTF-8".

Quite a journey. The ultimate culprit was an R text cleaning function but I
wonder why the Excel sheet was in Windows-1252 encoding and can't R import
Excel files directly?

~~~
tialaramex
Much of Microsoft Office for Windows dates to an awkward period when it was
apparent that ASCII isn't enough, but UTF-8 isn't yet the obvious winner.

In particular there's a period where you get "strings" that aren't actually
text as we'd understand it but instead a sequence of glyph numbers for a
typeface. So instead of ASCII's "A" or Unicode's U+0041 LATIN CAPITAL LETTER A
you're just encoding that you want whatever the typeface named "Typewriter
Sans" has put in slot number 65. Maybe it's a capital A and maybe it isn't.

This works pretty well on one Windows PC, or even a LAN full of identically
configured Windows PCs. You can see why it was at least superficially
attractive to Windows application programmers. Got a niche that needs
Bulgarian ? No problem, the software doesn't care what the glyphs "mean", just
define how to type the codes in and you're done.

But then somebody tries to open the Excel file with the Bulgarian text in
"Steve's Bulgarian" font on a Mac and it doesn't work, it's gibberish. Oh
dear. You have to install "Steve's Bulgarian" on the Mac and it look better
although it isn't quite exactly right.

Today this is obviously completely insane, but by then the problem is
backwards compatibility.

~~~
peterburkimsher
Just some specific dates.

Microsoft Office was released in 1989 [1], and came out on the Mac before it
was ported to Windows.

UTF-8 was designed in 1992, on a placemat [2]. It has been dominant on the web
[3] since 2009, which seems like ages ago now, but is really quite recent.

[1]
[https://en.wikipedia.org/wiki/Microsoft_Office](https://en.wikipedia.org/wiki/Microsoft_Office)

[2]
[https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt](https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt)

[3]
[https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowt...](https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg)

~~~
mkozlows
That's a misleading date for UTF-dominance -- as the link notes, it's counting
things as ASCII if they only have ASCII characters, no matter what the
encoding was specified as, and ASCII is a subset of UTF-8.

I'd put UTF-8's real prominence with the popularity of XML (where it was the
default encoding). That was specified in 1998, and ubiquitous within a few
years. Nobody at the height of XML would have tried doing a non-UTF design for
anything.

~~~
db48x
XML is almost UTF-8, but some characters were disallowed in XML 1.0, and it
wasn't fixed until later. This still causes bugs where I work, here in 2019.

------
ngcc_hk
By R should be in the title?

~~~
kazinator
Where would the spoiler alert go if it's in the title?

