
Ftfy – fix Unicode that's broken in various ways - simonw
https://ftfy.now.sh/
======
cosmie
Ftfy and Unidecode[1] are two of the first libraries I reach for when munging
around with a new dataset. My third favorite is the regex filter [^ -~]. If
you know you’re working with ASCII data, that single regex against any
untrusted data is resolves soooo many potential headaches.

I manage and source data for a lead generation/marketing firm, and dealing
with free text data from all over the place is a nightmare. Even when working
with billion dollar firms, data purchases take the form of CSV files with
limited or no documentation on structure or formatting or encoding, sales-
oriented data dictionaries, and FTP drops. I have a preprocessing script in
python that strips out lines that can’t be parsed as utf8, stage it into a
Redshift cluster, then hit it with a UDF in Redshift called kitchen_sink that
encapsulates a lot of my text cleanup and validation heuristics like the
above.

~~~
jonathan_n
This. Ftfy has saved my as many many times. Thanks for the regex. brilliant.

~~~
cosmie
I don't remember where I came across that regex, but it's saved me from so
many headaches I quite literally get giddy any time I can insert it into a
processing stream.

A developer upstream naively split on /n and leave some errant /r characters
everywhere? Fixed.

Embedded backspaces or nulls or tabs or any sort of control character? Gone.

Non-ASCII data in a field you _know_ should be ASCII only? Ha, not today good
sir!

Until you've had to deal with the hell that is raw, free form data from who-
knows-where, you cannot even fathom how satisfying it is to be able to deploy
that regex (when appropriate) and know beyond a doubt that you can't have any
more gremlins hiding in that particular field/data that'll hit you later.

~~~
thiagocsf
It’s nice not having to deal with languages other than English.

In Portuguese, which I’ve worked with, you develop other tricks, like
replacing à, á, â or ã with a. But, in order to do this, you still need to
find out the encoding used before you can create the “ASCII” equivalent.

Fun trivia: coco means coconut; cocô means poo. So, by replacing ô with o,
you’re guaranteed a chuckle at some stage.

~~~
Piskvorrr
True. But if that specific case is dealing in, let's say, URLs, then all of
the content should be ASCII - either as a direct character representation, or
encoded, again to ASCII.

Never did I see the parent mention "this is sufficient for humans", or just
"...for English" \- even that would be a naïve assumption (see? ;)).

~~~
rspeer
If your definition of "URLs" includes "IRIs" (a term nobody really uses but
which encompasses the idea that you can use Unicode in URLs), then this isn't
a good assumption to make.

I would rather pass around a link to
[https://ja.wikipedia.org/wiki/メインページ](https://ja.wikipedia.org/wiki/メインページ)
than to
[https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3...](https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8)
, and lots of software will support both. And if you want all URLs to be
ASCII, you'll need to convert the first into the second, not just delete the
Japanese characters.

~~~
zlynx
If the URL is "on the wire" it had better be 7-bit ASCII. Actually even more
restrictive than that. Because that's the spec.
[https://tools.ietf.org/html/rfc3986](https://tools.ietf.org/html/rfc3986)

In user interaction with a browser or wherever else it seems that anything
goes.

~~~
rspeer
That's the spec but this is also the spec:
[https://www.ietf.org/rfc/rfc3987.txt](https://www.ietf.org/rfc/rfc3987.txt)

It's true that protocols such as HTTP only use ASCII URIs on the wire. If you
are implementing an HTTP client yourself, you will need to implement percent-
encoding.

Which is different from saying "well, I know URLs should be ASCII on the wire,
so I can safely delete all the non-ASCII characters from a URL." That's not
true.

~~~
Piskvorrr
As soon as you add or remove to the string, it no longer points to the same
resource; I considered this too obvious to mention. OTOH, for _validation_,
this is useful: "you have a ZWJ character in an URL, that's unlikely". And
yes, I understand that there are protocols that allow you to pass around the
full Unicode or aunt Matilda or whatever - I should have been more specific.

------
exikyut
\- _Submits issue to Chromium requesting it just run webpages through this
before displaying them_ (I've stumbled on old webpages with legitimately
broken encoding within the past 4-5 months)

\- _Creates Rube Goldberg machine to fix UTF-8 text copy-pasted through
TigerVNC_ (which I was surprised to discover setting LC_ALL doesn't fix)

\--

Fun trivia w/ VNC, because it's cute:

1\. Example from site: #╨┐╤Ç╨╨╨╕╨╗╤î╨╜╨╛╨╡╨┐╨╕╤é╨╨╜╨╕╨╡

2\. Fixed: #правильноепитание

3\. What I get when I copy #2 through VNC: #Ð¿ÑÐ°Ð²Ð¸Ð»ÑÐ½Ð¾ÐµÐ¿Ð¸ÑÐ°Ð½Ð¸Ðµ

4\. What happens when ftfy sees #3: #правильноепитание

5\. What happens when I copy #1 through VNC:
#â¨ââ¤Ãâ¨ââ¨ââ¨ââ¨ââ¤Ã®â¨ââ¨ââ¨â¡â¨ââ¨ââ¤Ã©â¨ââ¨ââ¨ââ¨â¡

6\. What happens when I paste #5 to ftfy: #правильноепитание

This is absolutely awesome.

VNC details: Arch Linux server, Slackware client, both running TigerVNC 1.8.0.

Bonus: what happens when I paste the above into Google Translate: "# Proper
nutrition" (nice!)

Edit: Wow, Arc didn't choke on this. It has good Unicode support. Nice.

~~~
rspeer
But if Chromium ran ftfy on all text, then you wouldn't be able to read the
examples on the page, or in your post :(

(In general, I will claim that the only false positives you're likely to
encounter using ftfy are test cases for ftfy!)

~~~
exikyut
Very good point.

This is an awesome library. My real question is... how on earth does it figure
out the text is "okay", that the demangling process is done?

~~~
rspeer
Each de-mangling step decreases a "cost" metric on the text, based on its
length plus the number of unusual combinations of characters. It never really
decides that text is "okay", but when there's no step it can take that
decreases the cost, it's done.

This is an imperfect greedy strategy, incidentally. If it takes multiple steps
to fix some text, it's possible that the first step it needs to take is not
the one that decreases the cost as much as possible, that it has to go through
some awful-looking intermediate state so that everything falls into place for
the next step. This is rare, though. I don't think I could come up with an
example.

~~~
exikyut
oooooh. Very interesting.

And I think I understand what you mean.

------
rspeer
I'm happy to see this Web implementation of ftfy! I especially appreciate how
it converts ftfy's fixing steps into example Python code.

Here's an interesting follow-up question for HN: one of the things that makes
ftfy work is the "ftfy.bad_codecs" subpackage. It registers new text codecs in
Python for encodings Python doesn't support. Should I be looking into actually
making this part of Python?

To elaborate: once ftfy detects that text has been decoded in the wrong
encoding, it needs to decode it in the right encoding, but that encoding may
very well be one that's not built into Python. CESU-8 (a brain-damaged way to
layer UTF-8 on top of UTF-16) would be one example. That one, at least, is
gradually going away in the wild (I thank emoji for this).

Other examples are the encodings that I've given names of the form "sloppy-
windows-NNNN", such as "sloppy-windows-1252". This is where you take a Windows
codepage with holes in it, such as the well-known Windows-1252 codepage, and
fill the holes with the useless control characters that are there in Latin-1.
(Why would you do such a thing? Well, because you get an encoding that's
compatible with Windows and that can losslessly round-trip any bytes.)

This has become such common practice on the Web that it's actually been
standardized by WHATWG [1].

If a Web page says it's in "latin-1", or "iso-8859-1", or "windows-1252", a
modern Web browser will actually decode it as what I've called "sloppy-
windows-1252". So perhaps this encoding needs a new name, such as "web-
windows-1252" or maybe "whatwg-1252". And similarly for 1251 and all the
others.

But instead of just doing this in the ftfy.bad_codecs subpackage, should I be
submitting a patch to Python itself to add "web-windows-NNNN" encodings,
because Python should be able to decode these now-standardized encodings? Feel
free to bikeshed what the encoding name should be, too.

[1] [https://encoding.spec.whatwg.org/#legacy-single-byte-
encodin...](https://encoding.spec.whatwg.org/#legacy-single-byte-encodings)

~~~
crdoconnor
Isn't it better staying out? It seems like it shares similar properties to
pytz - new codecs need to be added semi-regularly.

What was it Kenneth Reitz said? Something like standard library is where
packages go to die.

~~~
rspeer
Yeah, I know the saying.

My observation here is that the number of text encodings is generally
decreasing, due to the fact that UTF-8 is obviously good. I _want_ wacky
encodings to die. But this is just a class of encodings that have existed for
decades and that Python missed. Perhaps on the basis that they were non-
standard nonsense, but now they're standardized.

It could be argued that web-windows-1252 is the third most common encoding in
the world.

If I'm giving directions for how to decode text in this encoding, it currently
only works if you've imported ftfy first, even if you don't need ftfy.

~~~
simonw
Sounds to me like you've argued yourself around to pitching them for
inclusion! I find the argument that web-windows-1252 is supported by modern
browsers very convincing.

------
peterburkimsher
I'm trying to learn Chinese. I wrote
[http://pingtype.github.io](http://pingtype.github.io) to parse blocks of
text, and I'm now building up a large data set of movie subtitles, song
lyrics, Bible translations, etc.

Try reading this in TextWrangler: 1 . 教會組織: 小會:代議長老郭倍宏

The box causes the follow characters to be unreadable - it gets interpreted as
a half-character. Deleting it makes the text show correctly.

I tried it with ftfy, but it just copied the input through to the output.

~~~
rspeer
Interesting. Here's the output of ftfy.explain_unicode on that text:

    
    
        U+0031  1       [Nd] DIGIT ONE
        U+0020          [Zs] SPACE
        U+002E  .       [Po] FULL STOP
        U+0020          [Zs] SPACE
        U+6559  教      [Lo] CJK UNIFIED IDEOGRAPH-6559
        U+6703  會      [Lo] CJK UNIFIED IDEOGRAPH-6703
        U+7D44  組      [Lo] CJK UNIFIED IDEOGRAPH-7D44
        U+7E54  織      [Lo] CJK UNIFIED IDEOGRAPH-7E54
        U+003A  :       [Po] COLON
        U+0020          [Zs] SPACE
        U+F081  \uf081  [Co] <unknown>
        U+5C0F  小      [Lo] CJK UNIFIED IDEOGRAPH-5C0F
        U+6703  會      [Lo] CJK UNIFIED IDEOGRAPH-6703
        U+003A  :       [Po] COLON
        ...
    

The anomalous character is U+F081, a character from the Private Use Area.
TextWrangler is allowed to interpret it as whatever it wants, but I don't know
why that would mess up all the following characters.

Here's my theory. The text probably started out in the GBK encoding (used in
mainland China). GBK has had different versions, which supported slightly
different sets of characters. A number of these characters (decreasing as both
GBK and Unicode updated) have no corresponding character in Unicode, and the
standard thing to do when converting them to Unicode has been to convert them
into Private Use characters.

So that probably happened to this one, which may have started as a rare and
inconsistently-supported character.

Python's implementation of GBK (or GB18030) doesn't know what it is. So maybe
what we need to do is flip through this Chinese technical standard [1], or
maybe an older version of it, and track down which codepoint was historically
mapped to U+F081 and what it is now and hahaha oh god

[1]
[https://archive.org/stream/GB18030-2005/GB%2018030-2005#page...](https://archive.org/stream/GB18030-2005/GB%2018030-2005#page/n0/mode/1up)

------
wolfgang42
A few months ago I built a simple web interface for ftfy so I don't have to
start a Python interpreter whenever I need to decode mangled text:
[https://www.linestarve.com/tools/mojibake/](https://www.linestarve.com/tools/mojibake/)

------
pixelbeat
Nice. This handles the mangled example I discussed at:

[http://www.pixelbeat.org/docs/unicode_utils/](http://www.pixelbeat.org/docs/unicode_utils/)

------
534b44a
Reminds me of the Universal Cyrillic decoder [1]

And old MySQL db dump I have has some values such as: !Ãƒâ€šÃ‚Â¡!HONDA POW

Does anyone here have an idea if/how I can recover the mangled text?

[1] [https://2cyr.com/decode/?lang=en](https://2cyr.com/decode/?lang=en)

~~~
rspeer
In fact, ftfy already figures that text out! Here are the recovery steps that
the website outputs:

    
    
        import ftfy.bad_codecs  # enables sloppy- codecs
        s = '!Ãƒâ€šÃ‚Â¡!HONDA POW'
        s = s.encode('sloppy-windows-1252')
        s = s.decode('utf-8')
        s = s.encode('sloppy-windows-1252')
        s = s.decode('utf-8')
        s = s.encode('latin-1')
        s = s.decode('utf-8')
        print(s)
    

And the decoded text is (for some reason):

    
    
        !¡!HONDA POW

~~~
534b44a
Thank you, I'd also tested that but it seems to simply remove the mangled
string part. Maybe it's impossible to recover it automatically after all :/

~~~
rspeer
No, no. That _is_ the recovered text.

Originally, the text had one non-ASCII character, an upside-down exclamation
point. A series of unfortunate (but typical) things happened to that
character, turning it into 9 characters of nonsense, the 9th of which is
_also_ an upside-down exclamation point.

It looks like ftfy is just removing the first 8 characters, but it's reversing
a sequence of very specific things that happened to the text (which just
happens to be equivalent to removing the first 8 characters).

------
chadrs
This is awesome, it reminds me when we decided to add unicode support to our
API, but our code had been connecting to MySQL with Latin-1 connection. As
long as you read from a Latin-1 connection, it _looked_ like everything was
correct, but what was actually being stored was the UTF-8 bytes being decoded
as a Latin-1 string, and then re-encoded to UTF-8 since the column was UTF-8.
Basically:

string.encode("utf-8").decode("latin-1").encode("utf-8")

although technically what mysql calls latin-1 is actually using Windows-1252
:(

~~~
Sharlin
_although technically what mysql calls latin-1 is actually using Windows-1252
:(_

...and what mysql calls UTF-8 is a subset that only supports code points of up
to three bytes! To get UTF-8 you need to use "utf8mb4". Why anybody uses mysql
is beyond me.

------
teddyh
This could be a useful web service _for interactive use_ , since all strings
will be manually verified.

The underlying ftfy library was previously discussed three years ago
([https://news.ycombinator.com/item?id=8187418](https://news.ycombinator.com/item?id=8187418))
and my comments at the time are still relevant:

I can’t help but think that this [library] merely gives people the excuse they
need for not understanding this “Things-that-are-not-ASCII” problem. Using
this library is a desperate attempt to have a just-fix-it function, but it can
never cover all cases, and will inevitably corrupt data. To use this library
is to remain an ASCII neanderthal, ignorant of the modern world and the
difference of text, bytes and encodings.

Let me explain in some detail why this library is not a good thing:

In an ideal world, you would _know_ what encoding bytes are in and could
therefore decode them explicitly using the known correct encoding, and this
library would be redundant.

If instead, as is often the case in the real world, the coding is _unknown_ ,
there exists the question of how to resolve the _numerous ambiguities_ which
result. A library such as this would have to _guess_ what encoding to use in
each specific instance, and the choices it ideally should make are _extremely_
dependent on the circumstances and even the immediate context. As it is, the
library is hard-coded with some specific algorithms to choose some encodings
over others, and if those assumptions does not match your use case _exactly_ ,
the library will corrupt your data.

A much better solution would perhaps involve a machine learning solution to
the problem, and having the library be trained to deduce the probable
encodings from a large set of example data from each user’s _individual_ use
case. Even these will occasionally be wrong, but at least it would be the best
we could do with unknown encodings without resorting to manual processing.

However, a _one-size-fits-all_ “solution” such as this is merely giving people
a further excuse to keep not caring about encodings, to pretend that encodings
can be “detected”, and that there exists such a thing as “plain text”.

[…]

I have […] two main arguments:

1\. Due to its simplicity for a large group of naïve users, the library will
likely be prone to over- and misuse. Since the library uses guessing as its
method of decoding, and by definition a guess may be wrong, this will lead to
some unnecessary data corruption in situations where use of this library (and
the resulting data corruption) was not actually needed.

2\. The library uses a one-size-fits-all model in the area of guessing
encodings and language. This has historically proven to be less than a good
idea, since different users in different situations use different data and
encodings, and [the] library’s algorithm will not fit all situations equally
well. I [suggest] that a more tunable and customizable approach would indeed
be the best one could do in the cases where the encoding is actually not
known. (This minor complexity in use of the library would also have the
benefit of discouraging overuse in unwarranted situations, thus also resolving
the first point, above.)

~~~
rspeer
It's a little strange for you to be criticizing ftfy as an encoding guesser,
given that ftfy is not an encoding guesser. Are you thinking of chardet?

> In an ideal world, you would know what encoding bytes are in and could
> therefore decode them explicitly using the known correct encoding, and this
> library would be redundant.

Twitter is in a known encoding, UTF-8. Most of ftfy's examples come from
Twitter. ftfy is not redundant.

When ftfy gets the input "#╨┐╤Ç╨╨╨╕╨╗╤î╨╜╨╛╨╡╨┐╨╕╤é╨╨╜╨╕╨╡", it's not because
this tweet was somehow in a different encoding, it's because the bot that
tweeted it literally tweeted "#╨┐╤Ç╨╨╨╕╨╗╤î╨╜╨╛╨╡╨┐╨╕╤é╨╨╜╨╕╨╡", in UTF-8, due
to _its own_ problems. So you decode the text that was tweeted from UTF-8, and
then you start fixing it.

I still think you're thinking of chardet.

> If instead, as is often the case in the real world, the coding is unknown...

...then you will need to detect its encoding somehow. By now ftfy is a library
for Python 3 only. If you try to pass bytes into the ftfy function, the Python
language itself will stop you.

Are you hypothesizing that everyone dealing with unmarked bytes is passing
them through a chain of chardet and ftfy, and blaming ftfy for all the
problems that would result?

Incidentally, I do machine learning. (That's _why_ I had to make ftfy, after
all.) I have tried many machine learning solutions. They do not come close to
ftfy's heuristics, which are designed to have extremely low false positive
rates that are not attainable by ML. If you want one false positive per
billion inputs... you're going to need like a quadrillion inputs, or you're
going to need a lovingly hand-tuned heuristic.

~~~
teddyh
> _ftfy is not an encoding guesser_

If it isn’t an encoding guesser, what does it do that "".decode("encoding")
doesn’t do?

~~~
bulatb
A guesser answers the question: what encoding did they _actually_ use?

FTFY answers the question: What horrifying sequence of encode/decode
transforms could output this sequence of bytes in UTF-8 that, when correctly
decoded as UTF-8, still results in total gibberish?

In other words...

The problem fixed by an encoding guesser:

1\. I encode my text with something that's not UTF-8-compatible.

2\. I lie to you and say it's UTF-8.

3\. You decode it as UTF-8 and get nonsense. What the heck?

4\. A guesser tells you what encoding I actually used.

5\. You decode it from the guessed encoding and get text.

    
    
      ---- 
    

The problem fixed by FTFY:

1\. I encode string S with non-UTF-8 codec C.

2\. I lie that it's UTF-8.

3\. Someone decodes it as UTF-8. It's full of garbage, but they don't care.

4\. They encode that sequence of nonsense symbols, not the original text, as
UTF-8. Let's charitably name this "encoding" C'.

5\. They say: Here teddyh, take this nice UTF-8.

6\. You decode it as UTF-8. What the heck?

7\. Is it ISO-8859? Some version of windows-X? Nope. It's UTF-8 carrying C', a
non-encoding someone's broken algorithm made up on the spot. There's no
decoder that can turn your UTF-8 back into the symbols of S, because the text
you got was already garbage.

8\. FTFY figures out what sequence of mismatched encode/decode steps generates
text in C' and does the inverse, giving you back C^-1( C'^-1( C'( C( S )))) =
S.

------
scoopr
The mona lisa example is delightful!

------
tminima
Damn this is good. I had faced a similar issue where the CSV had mixed
encodings. That time I never looked for a library, I read a few SO answers and
created an adhoc python script to make the file encoding uniform. Ftfy would
have made my work simpler.

------
erAck
Nice tool. Just the title is misleading, it's not Unicode that is broken, it's
the encoders/decoders..

------
alanfalcon
Oh man have I got some spreadsheets I’d love to throw this tool at.

------
tzury
for some reason, all examples on that page deal with "sloppy-windows-1252" >
"utf-8".

Is it fair to say it is a sloppy-windows-1252 fixer?

~~~
orangea
Not when I click on them they don't.

------
nukeop
Someone should add this as an automatic preprocessing filter for Slashdot
comments.

------
tzahola
No, Unicode text is not broken. It’s either your program that’s broken, for it
is interpreting ISO-8859-X/Windows-12XX/whatever as UTF-X; or the program that
produced said data.

~~~
rspeer
I wrote the original library. Your statement is true, but in many cases, not
useful.

As quoted from the documentation [1]:

> Of course you're better off if your input is decoded properly and has no
> glitches. But you often don't have any control over your input; it's someone
> else's mistake, but it's your problem now.

[1] [https://github.com/LuminosoInsight/python-
ftfy](https://github.com/LuminosoInsight/python-ftfy)

