
Obscure Python libraries for data science - jhibbets
https://opensource.com/article/18/11/python-libraries-data-science
======
cosmie
If you're doing anything with text, ftfy[1][2] is an oasis of sanity in a
world filled with torment. No data source is safe from the fate of having its
encoding butchered as it passes through the cruel fate of dozens of getting
tossed from databases to applications to data pipelines and eventually to you.
All of which make whatever default encoding assumptions that are most
convenient or backwards compatible with themselves, unintended data mutations
be damned.

Once you've zapped as much mojibake[3] from your data as possible, follow it
with a pass through csvclean[4] so you have confidence your data is delimited
and escaped _exactly_ how you want/expect it to be and can be processed and
ingested with confidence. Then, when you need to cram it back into a legacy
system that only supports ASCII, unidecode[5] for the win. And every now and
then transliterate[6] comes to the rescue for the odd need.

[1]
[https://ftfy.readthedocs.io/en/latest/](https://ftfy.readthedocs.io/en/latest/)

[2] The main fix_text() function is by far the crown jewel. But there are
quite a few handy helper functions in the library that don't get wrapped into
fix_text() and have to be called independently when desired.

[3]
[https://en.wikipedia.org/wiki/Mojibake](https://en.wikipedia.org/wiki/Mojibake)

[4]
[https://csvkit.readthedocs.io/en/1.0.3/scripts/csvclean.html](https://csvkit.readthedocs.io/en/1.0.3/scripts/csvclean.html)

[5] [https://github.com/iki/unidecode/](https://github.com/iki/unidecode/)

[6]
[https://pypi.org/project/transliterate/](https://pypi.org/project/transliterate/)

