

Divining encodings on the Web - coderdude
http://nikitathespider.com/articles/EncodingDivination.html

======
PaulHoule
This article and the algorithm in it is pretty weak. The trouble is that
encodings are commonly misrepresented in the metadata for documents, and that
many documents are encoded in ways that aren't quite right.

For instance, it's frequent to find web sites that contain mostly US-ASCII
text with a smattering of ISO-Latin-1 and UTF-8 sequences for characters with
codepoints > 127\. This happens because, with many toolchains, text can get
combined in different encodings.

------
pak
The web is a wonderful thing, but I don't envy the lot of browser makers.
Every once in a while whenever I have to replicate some portion of the whole
HTTP/HTML stack, like cookies or in this article content encoding detection,
it becomes apparent what a disaster certain things on the web evolved into.

------
simonw
Mark Pilgrim's chardet library is great for dealing with unknown uncodings -
it uses statistical analysis to figure out the encoding, based on code back-
ported from Firefox/Mozilla:

<http://chardet.feedparser.org/>

------
mapgrep
If you're going to use the BOM despite it not being in the HTTP spec, why not
ignore the spec entirely for encoding? The spec is pretty brain dead in that
it says the HTTP header should override explicit declarations within the
document itself.

When I've written this sort of code I've found it most practically effective
to look at any explicit document declarations (meta tags), then the BOM, then
any XML declarations (XHTML was big once upon a time), then the HTTP headers.
Far more people have control over their HTML docs than over their HTTP server
configuration and I think it's insane to look at the former and ignore the
latter. (Actually, there was one case in which I did so: When the BOM and the
HTTP headers agreed, I would ignore meta tags, which are downright wrong more
often than you would think. For some reason there are a lot of pages out there
encoded in UTF-8 but with meta tags declaring Latin-1 [not vice versa like you
might think]. This seemed to be correlated with ColdFusion.)

That said, it's impressive that he's thought through different paths for
text/plain and application/xml. The last, most comprehensive function I wrote
to do this assumed HTML or XML, which obviously isn't comprehensive, and I
missed the text/xml wrinkle.

------
rix0r
I wonder why he makes a distinction between the MIME types application/xml and
text/xml.

An XML document of the former type is parsed for the <?xml encoding="..." ?>
tag, but the latter isn't?

I always thought that application/xml and text/xml could be used
interchangeably. Is my assumption wrong or is his process flawed?

~~~
thristian
What you're missing is that text/xml is specially handled (some might say,
'broken') by RFC 3023: <http://annevankesteren.nl/2005/03/text-xml>

------
fedd
"the fact that text/xml documents default to US-ASCII surprises a lot of
people (including me), but the HTTP spec is very clear about it."

it's a [s]man's[/s] American world :) :(

~~~
fedd
thanks upvoters. just think, Americans never ever saw quotation or square
marks in their documents or on the web, to the contrary to Russians and others
([http://en.wikipedia.org/wiki/Mojibake#Russian_and_other_Cyri...](http://en.wikipedia.org/wiki/Mojibake#Russian_and_other_Cyrillic-
based_scripts))

this is sortof deserved, though :)

------
wooster
At web scale, this won't work very well (although it's a good start). For
example, I found this one the last time I was doing encoding detection work:

<http://www.nextthing.org/archives/2007/03/31/microsoft-web>

