Hacker News new | past | comments | ask | show | jobs | submit login
Ftfy – fix Unicode that's broken in various ways (now.sh)
226 points by simonw on Jan 9, 2018 | hide | past | web | favorite | 57 comments



Ftfy and Unidecode[1] are two of the first libraries I reach for when munging around with a new dataset. My third favorite is the regex filter [^ -~]. If you know you’re working with ASCII data, that single regex against any untrusted data is resolves soooo many potential headaches.

I manage and source data for a lead generation/marketing firm, and dealing with free text data from all over the place is a nightmare. Even when working with billion dollar firms, data purchases take the form of CSV files with limited or no documentation on structure or formatting or encoding, sales-oriented data dictionaries, and FTP drops. I have a preprocessing script in python that strips out lines that can’t be parsed as utf8, stage it into a Redshift cluster, then hit it with a UDF in Redshift called kitchen_sink that encapsulates a lot of my text cleanup and validation heuristics like the above.


This. Ftfy has saved my as many many times. Thanks for the regex. brilliant.


I don't remember where I came across that regex, but it's saved me from so many headaches I quite literally get giddy any time I can insert it into a processing stream.

A developer upstream naively split on /n and leave some errant /r characters everywhere? Fixed.

Embedded backspaces or nulls or tabs or any sort of control character? Gone.

Non-ASCII data in a field you know should be ASCII only? Ha, not today good sir!

Until you've had to deal with the hell that is raw, free form data from who-knows-where, you cannot even fathom how satisfying it is to be able to deploy that regex (when appropriate) and know beyond a doubt that you can't have any more gremlins hiding in that particular field/data that'll hit you later.


It’s nice not having to deal with languages other than English.

In Portuguese, which I’ve worked with, you develop other tricks, like replacing à, á, â or ã with a. But, in order to do this, you still need to find out the encoding used before you can create the “ASCII” equivalent.

Fun trivia: coco means coconut; cocô means poo. So, by replacing ô with o, you’re guaranteed a chuckle at some stage.


True. But if that specific case is dealing in, let's say, URLs, then all of the content should be ASCII - either as a direct character representation, or encoded, again to ASCII.

Never did I see the parent mention "this is sufficient for humans", or just "...for English" - even that would be a naïve assumption (see? ;)).


If your definition of "URLs" includes "IRIs" (a term nobody really uses but which encompasses the idea that you can use Unicode in URLs), then this isn't a good assumption to make.

I would rather pass around a link to https://ja.wikipedia.org/wiki/メインページ than to https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3... , and lots of software will support both. And if you want all URLs to be ASCII, you'll need to convert the first into the second, not just delete the Japanese characters.


If the URL is "on the wire" it had better be 7-bit ASCII. Actually even more restrictive than that. Because that's the spec. https://tools.ietf.org/html/rfc3986

In user interaction with a browser or wherever else it seems that anything goes.


That's the spec but this is also the spec: https://www.ietf.org/rfc/rfc3987.txt

It's true that protocols such as HTTP only use ASCII URIs on the wire. If you are implementing an HTTP client yourself, you will need to implement percent-encoding.

Which is different from saying "well, I know URLs should be ASCII on the wire, so I can safely delete all the non-ASCII characters from a URL." That's not true.


As soon as you add or remove to the string, it no longer points to the same resource; I considered this too obvious to mention. OTOH, for _validation_, this is useful: "you have a ZWJ character in an URL, that's unlikely". And yes, I understand that there are protocols that allow you to pass around the full Unicode or aunt Matilda or whatever - I should have been more specific.


The first version is much more readable and less hacky; alas, unless you are positive that all the software you're interfacing with can handle Unicode (never have I ever been in such a glorious situation), fallback to URL-encoding it is.


I was hoping it was a replacement for Unicode... Unicode is so broken that we need something new.


What does that regex do?


Matches anything that's not between the space and tilde in the ASCII code range[1], which is the entire range of printable ASCII characters. It's similar to [a-Z] you see a lot, but expanded to include the space character, numbers, and punctuation. The regex [ -~] lets you match those characters, whereas [^ -~] negates that and matches anything that's not a printable ASCII character (useful for regex replace functions).

If you look at the table at the top of [1], you'll notice all of the characters at the beginning of the ASCII range which are non-printing and therefore invisible. Plus, at the end for some insane reason, the DELETE character. If there's no valid reason for any of these characters to exist in your dataset, nor for higher code point (UTF) characters to exist, then [^ -~] will match them and let you strip them out all in one go.

[1] http://www.asciitable.com/


The reason for DEL being 0x7f is that it is originally intended as marker for "this character was deleted and should be skipped over" and not as command to delete previous/next character. And you can change any ASCII character into 0x7f by ORing it with with 0x7f, that is, punching all the holes in punched tape (or punchcard, but ASCII usually was not used for them and punching all holes on punchcard is not advisable for mechanical reasons).


Love these historical notes that make everything make sense. Thanks for posting!


> The regex [ -~] lets you match those characters, whereas [^ -~] negates that and matches anything that's not a printable ASCII character

Oh dear, I think I finally understand that mysterious phenomenon in some websites where, text I write in my native language gets saved and displayed as empty text, but writing in English works. I've come across this a few times over the years, in random forgotten corners of the Internet. It could well be that this regex got written down somewhere as good security practice, and some developers out there copied it into the code without thinking about whether the ASCII restriction is applicable.


It's probably a more mundane issue related to either older versions of HTML (which did not support UTF8[1]), or issues with an older database that stored things as latin1 or ASCII by default, or issues with an older programming language that doesn't seamlessly support non-latin1 or non-ascii characters without deliberately doing so (like Python 2).

That said, I do agree it's not something that should be used without thought. I mentioned that my usage generally involves knowledge that the data set has no valid reason to contain higher code points, and usually involves usage of Unidecode[2] to convert higher code points into ASCII equivalents (which in many cases strips out contextual knowledge, but is sometimes an acceptable trade off for stability and predictability of the data sent downstream).

[1] https://www.w3schools.com/html/html_charset.asp [2] https://pypi.python.org/pypi/Unidecode


Do you've any idea whether it's possible to search invisible characters using google?


I'm not sure, but Google Search in general strips punctuation that isn't specifically designated as a search operator[1], and likely strips control/non-printing characters in a similar manner. Testing it[2] with a few URL-encoded control and invisible characters[3], they seem to be ignored. But they do make it into the query parameter, at least until you edit it for a subsequent search, at which point they get converted into spaces/+ symbols in the query.

[1] https://support.google.com/websearch/answer/2466433?hl=en

[2] https://www.google.com/search?hl=en&q="%0B%08+Foo+%7F"

[3] You can see the full string being searched here, it's 7 characters long: https://r12a.github.io/uniview/?charlist=%08%20Foo%20%7F


Thank you very much for this detailed reply! Made my day :)


Note that when using GNU grep (version 2.25) you have to use -P flag for Perl compatible regular expressions. E.g.:

echo "lörem ipsum" | grep -P "[^ -~]" --color

and not

echo "lörem ipsum" | grep "[^ -~]" --color


- Submits issue to Chromium requesting it just run webpages through this before displaying them (I've stumbled on old webpages with legitimately broken encoding within the past 4-5 months)

- Creates Rube Goldberg machine to fix UTF-8 text copy-pasted through TigerVNC (which I was surprised to discover setting LC_ALL doesn't fix)

--

Fun trivia w/ VNC, because it's cute:

1. Example from site: #╨┐╤Ç╨╨╨╕╨╗╤î╨╜╨╛╨╡╨┐╨╕╤é╨╨╜╨╕╨╡

2. Fixed: #правильноепитание

3. What I get when I copy #2 through VNC: #правильноепитание

4. What happens when ftfy sees #3: #правильноепитание

5. What happens when I copy #1 through VNC: #правильноепитание

6. What happens when I paste #5 to ftfy: #правильноепитание

This is absolutely awesome.

VNC details: Arch Linux server, Slackware client, both running TigerVNC 1.8.0.

Bonus: what happens when I paste the above into Google Translate: "# Proper nutrition" (nice!)

Edit: Wow, Arc didn't choke on this. It has good Unicode support. Nice.


But if Chromium ran ftfy on all text, then you wouldn't be able to read the examples on the page, or in your post :(

(In general, I will claim that the only false positives you're likely to encounter using ftfy are test cases for ftfy!)


Very good point.

This is an awesome library. My real question is... how on earth does it figure out the text is "okay", that the demangling process is done?


Each de-mangling step decreases a "cost" metric on the text, based on its length plus the number of unusual combinations of characters. It never really decides that text is "okay", but when there's no step it can take that decreases the cost, it's done.

This is an imperfect greedy strategy, incidentally. If it takes multiple steps to fix some text, it's possible that the first step it needs to take is not the one that decreases the cost as much as possible, that it has to go through some awful-looking intermediate state so that everything falls into place for the next step. This is rare, though. I don't think I could come up with an example.


oooooh. Very interesting.

And I think I understand what you mean.


I'm happy to see this Web implementation of ftfy! I especially appreciate how it converts ftfy's fixing steps into example Python code.

Here's an interesting follow-up question for HN: one of the things that makes ftfy work is the "ftfy.bad_codecs" subpackage. It registers new text codecs in Python for encodings Python doesn't support. Should I be looking into actually making this part of Python?

To elaborate: once ftfy detects that text has been decoded in the wrong encoding, it needs to decode it in the right encoding, but that encoding may very well be one that's not built into Python. CESU-8 (a brain-damaged way to layer UTF-8 on top of UTF-16) would be one example. That one, at least, is gradually going away in the wild (I thank emoji for this).

Other examples are the encodings that I've given names of the form "sloppy-windows-NNNN", such as "sloppy-windows-1252". This is where you take a Windows codepage with holes in it, such as the well-known Windows-1252 codepage, and fill the holes with the useless control characters that are there in Latin-1. (Why would you do such a thing? Well, because you get an encoding that's compatible with Windows and that can losslessly round-trip any bytes.)

This has become such common practice on the Web that it's actually been standardized by WHATWG [1].

If a Web page says it's in "latin-1", or "iso-8859-1", or "windows-1252", a modern Web browser will actually decode it as what I've called "sloppy-windows-1252". So perhaps this encoding needs a new name, such as "web-windows-1252" or maybe "whatwg-1252". And similarly for 1251 and all the others.

But instead of just doing this in the ftfy.bad_codecs subpackage, should I be submitting a patch to Python itself to add "web-windows-NNNN" encodings, because Python should be able to decode these now-standardized encodings? Feel free to bikeshed what the encoding name should be, too.

[1] https://encoding.spec.whatwg.org/#legacy-single-byte-encodin...


Isn't it better staying out? It seems like it shares similar properties to pytz - new codecs need to be added semi-regularly.

What was it Kenneth Reitz said? Something like standard library is where packages go to die.


Yeah, I know the saying.

My observation here is that the number of text encodings is generally decreasing, due to the fact that UTF-8 is obviously good. I want wacky encodings to die. But this is just a class of encodings that have existed for decades and that Python missed. Perhaps on the basis that they were non-standard nonsense, but now they're standardized.

It could be argued that web-windows-1252 is the third most common encoding in the world.

If I'm giving directions for how to decode text in this encoding, it currently only works if you've imported ftfy first, even if you don't need ftfy.


Sounds to me like you've argued yourself around to pitching them for inclusion! I find the argument that web-windows-1252 is supported by modern browsers very convincing.


If they're all 10 years old and on their way out then yeah, I suppose it would make sense to include them in python - whether or not they're nonsense.


Presumably the way to do it would be to add the individual encodings as modules in https://github.com/python/cpython/tree/master/Lib/encodings so the risk of stagnation would be low.

I guess the bigger issue would be bikeshedding the names and aliases...


I'm trying to learn Chinese. I wrote http://pingtype.github.io to parse blocks of text, and I'm now building up a large data set of movie subtitles, song lyrics, Bible translations, etc.

Try reading this in TextWrangler: 1 . 教會組織: 小會:代議長老郭倍宏

The box causes the follow characters to be unreadable - it gets interpreted as a half-character. Deleting it makes the text show correctly.

I tried it with ftfy, but it just copied the input through to the output.


Interesting. Here's the output of ftfy.explain_unicode on that text:

    U+0031  1       [Nd] DIGIT ONE
    U+0020          [Zs] SPACE
    U+002E  .       [Po] FULL STOP
    U+0020          [Zs] SPACE
    U+6559  教      [Lo] CJK UNIFIED IDEOGRAPH-6559
    U+6703  會      [Lo] CJK UNIFIED IDEOGRAPH-6703
    U+7D44  組      [Lo] CJK UNIFIED IDEOGRAPH-7D44
    U+7E54  織      [Lo] CJK UNIFIED IDEOGRAPH-7E54
    U+003A  :       [Po] COLON
    U+0020          [Zs] SPACE
    U+F081  \uf081  [Co] <unknown>
    U+5C0F  小      [Lo] CJK UNIFIED IDEOGRAPH-5C0F
    U+6703  會      [Lo] CJK UNIFIED IDEOGRAPH-6703
    U+003A  :       [Po] COLON
    ...
The anomalous character is U+F081, a character from the Private Use Area. TextWrangler is allowed to interpret it as whatever it wants, but I don't know why that would mess up all the following characters.

Here's my theory. The text probably started out in the GBK encoding (used in mainland China). GBK has had different versions, which supported slightly different sets of characters. A number of these characters (decreasing as both GBK and Unicode updated) have no corresponding character in Unicode, and the standard thing to do when converting them to Unicode has been to convert them into Private Use characters.

So that probably happened to this one, which may have started as a rare and inconsistently-supported character.

Python's implementation of GBK (or GB18030) doesn't know what it is. So maybe what we need to do is flip through this Chinese technical standard [1], or maybe an older version of it, and track down which codepoint was historically mapped to U+F081 and what it is now and hahaha oh god

[1] https://archive.org/stream/GB18030-2005/GB%2018030-2005#page...


What character are you trying to use there? I can't see it on my browser either. Can you provide a screenshot of it?


A few months ago I built a simple web interface for ftfy so I don't have to start a Python interpreter whenever I need to decode mangled text: https://www.linestarve.com/tools/mojibake/


Nice. This handles the mangled example I discussed at:

http://www.pixelbeat.org/docs/unicode_utils/


Reminds me of the Universal Cyrillic decoder [1]

And old MySQL db dump I have has some values such as: !¡!HONDA POW

Does anyone here have an idea if/how I can recover the mangled text?

[1] https://2cyr.com/decode/?lang=en


In fact, ftfy already figures that text out! Here are the recovery steps that the website outputs:

    import ftfy.bad_codecs  # enables sloppy- codecs
    s = '!¡!HONDA POW'
    s = s.encode('sloppy-windows-1252')
    s = s.decode('utf-8')
    s = s.encode('sloppy-windows-1252')
    s = s.decode('utf-8')
    s = s.encode('latin-1')
    s = s.decode('utf-8')
    print(s)
And the decoded text is (for some reason):

    !¡!HONDA POW


Thank you, I'd also tested that but it seems to simply remove the mangled string part. Maybe it's impossible to recover it automatically after all :/


No, no. That is the recovered text.

Originally, the text had one non-ASCII character, an upside-down exclamation point. A series of unfortunate (but typical) things happened to that character, turning it into 9 characters of nonsense, the 9th of which is also an upside-down exclamation point.

It looks like ftfy is just removing the first 8 characters, but it's reversing a sequence of very specific things that happened to the text (which just happens to be equivalent to removing the first 8 characters).


This is awesome, it reminds me when we decided to add unicode support to our API, but our code had been connecting to MySQL with Latin-1 connection. As long as you read from a Latin-1 connection, it looked like everything was correct, but what was actually being stored was the UTF-8 bytes being decoded as a Latin-1 string, and then re-encoded to UTF-8 since the column was UTF-8. Basically:

string.encode("utf-8").decode("latin-1").encode("utf-8")

although technically what mysql calls latin-1 is actually using Windows-1252 :(


although technically what mysql calls latin-1 is actually using Windows-1252 :(

...and what mysql calls UTF-8 is a subset that only supports code points of up to three bytes! To get UTF-8 you need to use "utf8mb4". Why anybody uses mysql is beyond me.


This could be a useful web service for interactive use, since all strings will be manually verified.

The underlying ftfy library was previously discussed three years ago (https://news.ycombinator.com/item?id=8187418) and my comments at the time are still relevant:

I can’t help but think that this [library] merely gives people the excuse they need for not understanding this “Things-that-are-not-ASCII” problem. Using this library is a desperate attempt to have a just-fix-it function, but it can never cover all cases, and will inevitably corrupt data. To use this library is to remain an ASCII neanderthal, ignorant of the modern world and the difference of text, bytes and encodings.

Let me explain in some detail why this library is not a good thing:

In an ideal world, you would know what encoding bytes are in and could therefore decode them explicitly using the known correct encoding, and this library would be redundant.

If instead, as is often the case in the real world, the coding is unknown, there exists the question of how to resolve the numerous ambiguities which result. A library such as this would have to guess what encoding to use in each specific instance, and the choices it ideally should make are extremely dependent on the circumstances and even the immediate context. As it is, the library is hard-coded with some specific algorithms to choose some encodings over others, and if those assumptions does not match your use case exactly, the library will corrupt your data.

A much better solution would perhaps involve a machine learning solution to the problem, and having the library be trained to deduce the probable encodings from a large set of example data from each user’s individual use case. Even these will occasionally be wrong, but at least it would be the best we could do with unknown encodings without resorting to manual processing.

However, a one-size-fits-all “solution” such as this is merely giving people a further excuse to keep not caring about encodings, to pretend that encodings can be “detected”, and that there exists such a thing as “plain text”.

[…]

I have […] two main arguments:

1. Due to its simplicity for a large group of naïve users, the library will likely be prone to over- and misuse. Since the library uses guessing as its method of decoding, and by definition a guess may be wrong, this will lead to some unnecessary data corruption in situations where use of this library (and the resulting data corruption) was not actually needed.

2. The library uses a one-size-fits-all model in the area of guessing encodings and language. This has historically proven to be less than a good idea, since different users in different situations use different data and encodings, and [the] library’s algorithm will not fit all situations equally well. I [suggest] that a more tunable and customizable approach would indeed be the best one could do in the cases where the encoding is actually not known. (This minor complexity in use of the library would also have the benefit of discouraging overuse in unwarranted situations, thus also resolving the first point, above.)


It's a little strange for you to be criticizing ftfy as an encoding guesser, given that ftfy is not an encoding guesser. Are you thinking of chardet?

> In an ideal world, you would know what encoding bytes are in and could therefore decode them explicitly using the known correct encoding, and this library would be redundant.

Twitter is in a known encoding, UTF-8. Most of ftfy's examples come from Twitter. ftfy is not redundant.

When ftfy gets the input "#╨┐╤Ç╨╨╨╕╨╗╤î╨╜╨╛╨╡╨┐╨╕╤é╨╨╜╨╕╨╡", it's not because this tweet was somehow in a different encoding, it's because the bot that tweeted it literally tweeted "#╨┐╤Ç╨╨╨╕╨╗╤î╨╜╨╛╨╡╨┐╨╕╤é╨╨╜╨╕╨╡", in UTF-8, due to its own problems. So you decode the text that was tweeted from UTF-8, and then you start fixing it.

I still think you're thinking of chardet.

> If instead, as is often the case in the real world, the coding is unknown...

...then you will need to detect its encoding somehow. By now ftfy is a library for Python 3 only. If you try to pass bytes into the ftfy function, the Python language itself will stop you.

Are you hypothesizing that everyone dealing with unmarked bytes is passing them through a chain of chardet and ftfy, and blaming ftfy for all the problems that would result?

Incidentally, I do machine learning. (That's why I had to make ftfy, after all.) I have tried many machine learning solutions. They do not come close to ftfy's heuristics, which are designed to have extremely low false positive rates that are not attainable by ML. If you want one false positive per billion inputs... you're going to need like a quadrillion inputs, or you're going to need a lovingly hand-tuned heuristic.


> ftfy is not an encoding guesser

If it isn’t an encoding guesser, what does it do that "".decode("encoding") doesn’t do?


A guesser answers the question: what encoding did they actually use?

FTFY answers the question: What horrifying sequence of encode/decode transforms could output this sequence of bytes in UTF-8 that, when correctly decoded as UTF-8, still results in total gibberish?

In other words...

The problem fixed by an encoding guesser:

1. I encode my text with something that's not UTF-8-compatible.

2. I lie to you and say it's UTF-8.

3. You decode it as UTF-8 and get nonsense. What the heck?

4. A guesser tells you what encoding I actually used.

5. You decode it from the guessed encoding and get text.

  ---- 
The problem fixed by FTFY:

1. I encode string S with non-UTF-8 codec C.

2. I lie that it's UTF-8.

3. Someone decodes it as UTF-8. It's full of garbage, but they don't care.

4. They encode that sequence of nonsense symbols, not the original text, as UTF-8. Let's charitably name this "encoding" C'.

5. They say: Here teddyh, take this nice UTF-8.

6. You decode it as UTF-8. What the heck?

7. Is it ISO-8859? Some version of windows-X? Nope. It's UTF-8 carrying C', a non-encoding someone's broken algorithm made up on the spot. There's no decoder that can turn your UTF-8 back into the symbols of S, because the text you got was already garbage.

8. FTFY figures out what sequence of mismatched encode/decode steps generates text in C' and does the inverse, giving you back C^-1( C'^-1( C'( C( S )))) = S.


Damn this is good. I had faced a similar issue where the CSV had mixed encodings. That time I never looked for a library, I read a few SO answers and created an adhoc python script to make the file encoding uniform. Ftfy would have made my work simpler.


Nice tool. Just the title is misleading, it's not Unicode that is broken, it's the encoders/decoders..


The mona lisa example is delightful!


Someone should add this as an automatic preprocessing filter for Slashdot comments.


Oh man have I got some spreadsheets I’d love to throw this tool at.


No, Unicode text is not broken. It’s either your program that’s broken, for it is interpreting ISO-8859-X/Windows-12XX/whatever as UTF-X; or the program that produced said data.


I wrote the original library. Your statement is true, but in many cases, not useful.

As quoted from the documentation [1]:

> Of course you're better off if your input is decoded properly and has no glitches. But you often don't have any control over your input; it's someone else's mistake, but it's your problem now.

[1] https://github.com/LuminosoInsight/python-ftfy


That's what I think as well. My native is Bulgarian and I've been through the whole range, you mentioned.

What saves me is having File -> Open with Encoding ( Sublime Text ). Unfortunately they removed it from Chrome [1].

https://superuser.com/questions/1160003/how-do-i-change-the-...


for some reason, all examples on that page deal with "sloppy-windows-1252" > "utf-8".

Is it fair to say it is a sloppy-windows-1252 fixer?


Not when I click on them they don't.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: