
Python-ftfy: Given Unicode text, make its representation possibly less broken - rspeer
https://github.com/LuminosoInsight/python-ftfy
======
rwg
It even successfully un-mangles the shipping label from Ode to a Shipping
Label:
[https://www.facebook.com/cmb/posts/619241744770551:0](https://www.facebook.com/cmb/posts/619241744770551:0)

    
    
        >>> ftfy.fix_text('L&amp;amp;Atilde;&amp;amp;sup3;pez')
        'López'
    

Bravo!

------
unicodedammit
Very common problem in web scraping, for example forum site might contain a
mix of MacRoman, Windows codepage and various european codepages in a single
page (yes, even in 2014!). Seems like a more advanced version of UnicodeDammit
module (
[http://www.crummy.com/software/BeautifulSoup/bs4/doc/#unicod...](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#unicode-
dammit) ).

Note: this module, like UnicodeDammit, is very US/English-centric, and is
_practically useless_ for worldwide web scraping. For non-english pages, it is
necessary to statistically estimate the codepage and language of each page
segment, and then try to normalize each segment to unicode.

~~~
m_mueller
The way I understand it from their examples, it's rather latin-written-
languages-centric, no? Could you give an example where it doesn't work with a
romanized language? If not, then I'd hardly call that English-centric and
practically useless worldwide.

~~~
unicodedammit
from __init__.guess_bytes(): "This is not a magic bullet. If the bytes are
coming from some MySQL database with the "character set" set to ISO Elbonian,
this won't figure it out. Perhaps more relevantly, this currently doesn't try
East Asian encodings."

The world is a very large place, there are many codepages in use besides
latin-1 and "ISO Elbonian". All central european contries use latin-2 (1250),
or cyrillic codepage (1251). Since they are all single-byte codepages, they
cannot be detected by try: convert() catch: try_another_codepage() and must be
distinguished statistically. LTR/RTL language and asian encoding detection is
even worse.
[https://en.wikipedia.org/wiki/Code_page](https://en.wikipedia.org/wiki/Code_page)
[https://en.wikipedia.org/wiki/Windows_code_pages](https://en.wikipedia.org/wiki/Windows_code_pages)

Another python Unicode conversion module which is slightly less US/English-
centric: [https://github.com/buriy/python-
readability](https://github.com/buriy/python-readability)

~~~
rspeer
I would like to emphasize that "guess_bytes" is not what ftfy is about. It's
there for convenience, and because the command-line version can't work without
it. But the main "fix_text" function isn't about guessing the encodings of
bytes.

Not all text arrives in the form of unmarked bytes. HTTP gives you bytes
marked with an encoding. JSON straight up gives you Unicode. Once you have
that Unicode, you might notice problems with it like mojibake, and ftfy is
designed for fixing that.

Like you say, encoding detection has to be done statistically. That's a great
goal for another project (man, I wish chardet could do it), but once
statistics get in there, it would be completely impossible to get a false
positive rate as low as ftfy has.

------
kev009
Had a dataset where this was the case.. old devs used Windows, I'm not sure
what the DB encoding was set as when they did imports, etc. I've been putting
off fixing it because it's just a PITA to deal with.

But I built a sanitizer in a couple hours with this lib, and it seems to work
pretty well.

The only unexpected thing is that it converts the ordinal indicator º to o in
addresses. Luckily there are only a handful I need to fix.

~~~
dbfclark
This is an impact of the default Unicode normalization, which is set to NFKC.
This normalization is lossy for things like the ordinal indicator and
trademark symbol; if you'd like to keep the ordinal indicator unchanged, use
NFC normalization:

    
    
      >>> print ftfy.fix_text(u'ordinal indicator º to o in addresses.')
      ordinal indicator o to o in addresses.
    
      >>> print ftfy.fix_text(u'ordinal indicator º to o in addresses.',normalization='NFC')
      ordinal indicator º to o in addresses.

------
teddyh
I can’t help but think that this merely gives people the excuse they need for
not understanding this “Things-that-are-not-ASCII” problem. Using this library
is a desperate attempt to have a _just-fix-it_ function, but it can never
cover all cases, and will inevitably corrupt data. To use this library is to
remain an ASCII neanderthal, ignorant of the modern world and the difference
of text, bytes and encodings.

Let me explain in some detail why this library is not a good thing:

In an ideal world, you would _know_ what encoding bytes are in and could
therefore decode them explicitly using the known correct encoding, and this
library would be redundant.

If instead, as is often the case in the real world, the coding is _unknown_ ,
there exists the question of how to resolve the _numerous ambiguities_ which
result. A library such as this would have to _guess_ what encoding to use in
each specific instance, and the choices it ideally should make are _extremely_
dependent on the circumstances and even the immediate context. As it is, the
library is hard-coded with some specific algorithms to choose some encodings
over others, and if those assumptions does not match your use case _exactly_ ,
the library will corrupt your data.

A much better solution would perhaps involve a machine learning solution to
the problem, and having the library be trained to deduce the probable
encodings from a large set of example data from each user’s _individual_ use
case. Even these will occasionally be wrong, but at least it would be the best
we could do with unknown encodings without resorting to manual processing.

However, a _one-size-fits-all_ “solution” such as this is merely giving people
a further excuse to keep not caring about encodings, to pretend that encodings
can be “detected”, and that there exists such a thing as “plain text”.

~~~
rspeer
I think you're criticizing the wrong library. ftfy isn't about encoding
detection. Should I take guess_bytes out of the documentation to stop giving
that impression?

It's the library you use when the data you get has _already_ been decoded
incorrectly. The user of ftfy cares about encodings, but gets data from
sources that don't.

And in no practical sense does it corrupt your data. I don't know where you
got that idea from. It leaves good data alone.

I will not say that false positives are _nonexistent_ , but they are
vanishingly rare -- see
[http://ftfy.readthedocs.org/en/latest/#accuracy](http://ftfy.readthedocs.org/en/latest/#accuracy)
\-- and they don't occur in "serious" data, they occur when people are
screwing around with bizarre emoticons and stuff.

~~~
fnl
Well, teddyh might have a point here, nonetheless: By now, I understand that
ftfy is about fixing mixed up encodings between UTF8, latin-1, CP437,
CP125[12] and MacRoman (only). But by claiming you are fixing "Unicode" in
general as the first thing on the GitHub page, you might be misleading first-
time visitors. Maybe you should try to place the "warning" about the encodings
your library does handle right at the start somewhere? And make it clear that
"moji-un-baking" is the library's central and main use-case, not just an
"interesting thing" it can do. Despite being quite aware of Unicode and string
encoding, I had exactly the same thoughts as teddyh as I read the first few
paragraphs ("Oh, now we will see those encoding illiterates converting all
those beautiful bytes in some highly informative character encoding to all-
too-boring-ASCII.")

Which leads me to my other concern: Why do you use NFKC compatibility as the
default normalization? Given you are a text mining company, you of all guys
should know you loose valuable information - particularly about numbers,
super- and subscript characters - with this normalization strategy. Doing NFKC
on stuff like all kinds of articles, books, patents, etc. would lead to
potentially disastrous results (e.g., NFKC "decomposes" the string
'O\u2082\u00B9' to 'O21' instead of 'O_2^1' \- "oxygen, reference 1"). In
general, I think NFC is what Python and many other libraries do, while I
believe NFKC should only be used when you know what you are doing (and why you
need it). Maybe it is useful for some strange, geeky tweets, but I would argue
that its the corner case, not the default.

~~~
rspeer
I wonder if I could change the default to NFC in the next version without
breaking people's expectations. It is a safer default.

When it comes to text analytics, the underlying tagger and stuff won't know
what O21 is any more than it knows what O_2^1 is anyway. And NFKC is useful
for mixed Latin and Japanese text, which I wouldn't entirely dismiss as
strange and geeky. But it's true that the default could be more conservative.

------
meowface
This is a great project because a lot of work has been put into finding
solutions to many different edge cases, and it "just works".

------
sylvinus
The last testimonial in the README gave me a good laugh :)

------
TazeTSchnitzel
This will mangle text discussing mojibake.

------
sdsk8
Dude, i'll pay you a beer if you want, this project is fucking awsome!

------
tdumitrescu
Wow, recruiting pitch at the top of the README. Someone make Adblock for
Github.

~~~
artursapek
I think the least we can give in return to companies that open-source useful
stuff like this is look the other way when they self-promote a little bit.

