
Unitools – A suite of tools for working with Unicode in the browser - causality
https://www.unicod.es/
======
LeifCarrotson
These tools are exactly why ASII is not dead.

Unicode is ideal for storing text exactly as the user wanted it. Which may be
crazy.

But when I am writing a program for internal use, or creating a communication
protocol, or trying to parse one program's output into another, or trying to
make a 7-segment display show some characters, I don't want to have to handle
these crazy possibilities. Just text is fine, thank you!

~~~
vmorgulis
> Just text is fine, thank you!

And Unicode has a non-negligible footprint.

I tried few times to handle that in C or C++ and everything becomes
complicated (even with a library like ICU).

Each time,I had to ask myself: \- What's the encoding ? \- How to detect it ?
\- What's the collation ? \- ...

I read Java strings are switching to ASCII (or 16b) because it's too much
inefficient.

~~~
jstimpfle
I think most software (like programming languages) has actually switched to
UTF-8 for internal representation, because it has low footprint (for most
applications).

You give up efficient O(1) indexing, but I don't think you typically need
that. Indexing is far less meaningful for Unicode compared to ASCII.

~~~
vmorgulis
> ... but I don't think you typically need that ...

Yes. String processing often occurs in a loop (like for trim, contains, split
...).

~~~
jstimpfle
trim, contains, split (at whitespace, or at a given _byte_ position) can all
be implemented efficiently in UTF-8.

------
peterkelly
An article I always recommend on this topic:

"The Absolute Minimum Every Software Developer Absolutely, Positively Must
Know About Unicode and Character Sets (No Excuses!)" by Joel Spolsky

[http://www.joelonsoftware.com/articles/Unicode.html](http://www.joelonsoftware.com/articles/Unicode.html)

------
david-given
Getting Unicode wrong can kill.

[http://gizmodo.com/382026/a-cellphones-missing-dot-kills-
two...](http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two-people-
puts-three-more-in-jail)

> The use of "i" resulted in an SMS with a completely twisted meaning: instead
> of writing the word "sıkısınca" it looked like he wrote "sikisince." Ramazan
> wanted to write "You change the topic every time you run out of arguments"
> (sounds familiar enough) but what Emine read was, "You change the topic
> every time they are fucking you"...

~~~
peterkelly
From the article, it sounds as if that wasn't the only factor...

------
teach
Despite the tongue-in-cheek tagline on the site, these _are_ some really neat
tools.

~~~
kdkooo
I agree! It's always disappointing when an invitation to contest a
controversial statement outweighs the information they were actually trying to
relay. Would love to seem more comments and discussion on clever applications
of some of these tools.

------
coldcode
Heck even EBCDIC isn't dead. Nothing dies forever as much as you'd like it to.

------
mhuffman
Embedded systems would like to talk with you ...

------
vmorgulis
Doesn't translate "é" as "&eactue;" ("&#233;" instead).

~~~
ygra
It generally changes characters to numerical character references, not HTML
character entities (which, incidentally, are a rather limited set of
characters and pretty much useless these days – they were useful to include a
few non-ASCII latin characters from Latin 1 in the mid-90s into HTML pages
with browser charset detection messing things up if included directly; by now
you just use Unicode and forget that a named set of entities ever existed).

~~~
vmorgulis
>...by now you just use Unicode and forget that a named set of entities ever
existed...

OK.

When I was in PHP, I often used "htmlspeciachars()" and it does this kind of
transform. That's why I tried that.

[http://php.net/manual/en/function.htmlspecialchars.php](http://php.net/manual/en/function.htmlspecialchars.php)

I had some problems too with "&nbsp;" because it's 160 in ASCII and not a
whitespace (32).

Edit: 160

~~~
ygra
The non-breaking space is certainly not 256 in ASCII (ASCII only extends to
127) and not even in Latin 1 (goes up to 255) and also not in Unicode – it's
160. While you can use &nbsp; in HTML you can also use &#160; or &#xA0;, both
of which work in XML as well. And since numeric and hexadecimal character
references are much easier to generate (you don't need a lookup table) I'd say
they're a far nicer choice these days.

~~~
vmorgulis
> ... in HTML you can also use &#160; or &#xA0 ...

Yes! You're right.

>... I'd say they're a far nicer choice these days.

It was a long time ago for a scrapping project (garbage HTML).

------
alexfisher
ASCII 4 Life!

