
So you think you know what a number is - DamonHD
http://chris.improbable.org/2014/8/25/adventures-in-unicode-digits/
======
badosu
Expecting to see some foundational mathematical logic content, was
disappointed.

~~~
Maken
Too bad, it's your old friend Unicode.

~~~
DoofusOfDeath
No worries, I'm sure the next version of Unicode will have a single glyph that
provides a complete introduction to the topic.

~~~
stcredzero
We need to extend Unicode to account for quantum computing and qubits. It
needs to evolve into Qunicode. Text with this feature baked in could be called
Quneiform. It should be stored on light-reflecting crystals that are
manufactured using kilns.

~~~
clock_tower
You should join the Unicode Consortium; I'm pretty sure you'd fit right in. :)

------
cperciva
Speaking of tools having trouble recognizing things, one of the more common
"bug reports" I get is that the Tarsnap website is disclosing user email
addresses -- specifically, "me@example.com" \-- and paths -- specifically, the
paths /home/auser and /home/anotheruser.

------
ianbertolacci

      >>> int("۲۶۷۹")
      2679
    

Thats pretty cool.

~~~
Tistron
Indeed. Seems to be a thing of python 3.

    
    
        $ ruby -v
        ruby 2.2.6p396 (2016-11-15 revision 56800) [x86_64-linux-gnu]
        $ ruby -e "puts \"۲۶۷۹\".to_i"
        0
    
        $ python2.7 -c "print(int(\"۲۶۷۹\"))"
        Traceback (most recent call last):
        File "<string>", line 1, in <module>
        ValueError: invalid literal for int() with base 10: '\xdb\xb2\xdb\xb6\xdb\xb7\xdb\xb9'
    
        $ python3.5 -c "print(int(\"۲۶۷۹\"))"
        2679
    
        $ node -v
        v6.9.1
        $ node -p "parseInt(\"۲۶۷۹\")"
        NaN

~~~
plus
It works in Python 2, you just have to ensure that your string is unicode. I
can't get it to work directly from the command line for some reason, but if
you open an interactive Python session it works:

    
    
      Python 2.7.12 (default, Dec 14 2016, 13:32:53) 
      [GCC 4.9.3] on linux2
      Type "help", "copyright", "credits" or "license" for more information.
      >>> print(int(u'۲۶۷۹'))
      2679

~~~
Tistron
Hmm, interesting. What's up with the terminal here? The shell is doing
something to the text.

    
    
        $ python2.7 -c "print(int(u'۲۶۷۹'))"
        Traceback (most recent call last):
        File "<string>", line 1, in <module>
        ValueError: invalid literal for int() with base 10: '\xdb\xb2\xdb\xb6\xdb\xb7\xdb\xb9'
        
        $ python2.7 
        Python 2.7.13 (default, Jan 03 2017, 17:41:54) [GCC] on linux2
        Type "help", "copyright", "credits" or "license" for more information.
        >>> print(int(u'۲۶۷۹'))
        2679
    

Looking at what the string is:

    
    
        $ python2.7 -c "print(list(u'۲۶۷۹'))"
        [u'\xdb', u'\xb2', u'\xdb', u'\xb6', u'\xdb', u'\xb7', u'\xdb', u'\xb9']
    
        $ python2.7 
        Python 2.7.13 (default, Jan 03 2017, 17:41:54) [GCC] on linux2
        Type "help", "copyright", "credits" or "license" for more information.
        >>> print(list(u'۲۶۷۹'))
        [u'\u06f2', u'\u06f6', u'\u06f7', u'\u06f9']

~~~
evincarofautumn
DB B2, etc. are the UTF-8 encodings of U+06F2, etc. So Python is seeing
mojibake: U+00DB (Û), U+00B2 (²), etc. which are not digits. Well, one of them
kinda is, but it’s No (“Number, other”), not Nd (“Number, decimal digit”).

~~~
Tistron
Yeah I get that, but why is that happening?

~~~
evincarofautumn
I’d guess because CPython is assuming the input to -c is ISO-8859-1 (Latin-1)
when it decodes it using Py_DecodeLocale():

    
    
        main()
          …
          setlocale(LC_ALL, "")
          …
          argv_copy[i] = Py_DecodeLocale(argv[i], NULL)
            …
            mbstowcs() or mbrtowc()
            …
          setlocale(LC_ALL, oldloc)
          …
          Py_Main(argc, argv_copy)
    

While the REPL’s encoding (sys.stdin.encoding) is set to UTF-8 due to
LANG/LC_CTYPE settings. You can get the same error when invoking the REPL as:

    
    
        LANG="en_US.iso8859-1" python2.7
    

So the shell isn’t doing anything to the text—it’s providing UTF-8 bytes in
both cases, it’s just that Python is interpreting them differently.

------
Veedrac
Note that Python has str.isdecimal, str.isdigit _and_ str.isnumeric.

    
    
                    isdecimal    isdigit   isnumeric
    
        12345        True        True       True
        ១2߃໔5        True        True       True
        ①²³🄅₅       False       True       True
        ⑩⒓          False       False      True
        Five         False       False      False
    

Use isdecimal if you want to call `int` (though it's EAFTP).

~~~
detaro
Don't forget to special case negative numbers though.

------
Grue3

           int("一万三千二百六十九")
        Traceback (most recent call last):
           File "python", line 1, in <module>
        ValueError: invalid literal for int() with base 10: '一万三千二百六十九'
           int("一三二六九")
        Traceback (most recent call last):
          File "python", line 1, in <module>
        ValueError: invalid literal for int() with base 10: '一三二六九'
    

Well, that was disappointing. For the record, my site
[http://ichi.moe/](http://ichi.moe/) can handle both (and Arabic numerals
too).

~~~
acdha
That actually came up in the comments
([http://chris.improbable.org/2014/8/25/adventures-in-
unicode-...](http://chris.improbable.org/2014/8/25/adventures-in-unicode-
digits/#comment-3279339258))

Apparently the Unicode consortium chose to exclude those characters because
they weren't encoded in a contiguous sequence so they're defined as
Numeric_Type=Digit rather than Numeric_Type=Decimal. They've apparently
realized that this is not helpful but chose to apply the updated policy only
to new ranges.

------
gumby
At least AFAIK all the non-CJKV scripts (except perhaps Mongolian?) use right-
to-left decimal characters with 0. So you should be able to "transliterate"
(transnumerate) ۲۶۷۹ to 2979 with a simple look up table and no R-L confusion.
In fact ۲9۷9 should be the same.

------
lacker
It's neat how that link lets you run Python interactively in the browser, with
[https://repl.it/site/languages/python3](https://repl.it/site/languages/python3)
. Normally that is just a JavaScript thing.

------
phyzome
This applies in Java as well, for instance Integer.parseInt and
Character.isDigit. Between Java and Python, you can get all sorts of numeric
characters through a sizable percentage of web applications.

I haven't yet found a security vulnerability based in this, but I keep
checking. :-)

~~~
acdha
Also .Net:
[https://blogs.msdn.microsoft.com/oldnewthing/20040309-00/?p=...](https://blogs.msdn.microsoft.com/oldnewthing/20040309-00/?p=40333)

… and, yes, that was definitely something I tried to find when I first noticed
it.

------
soVeryTired
A partition of the rationals into sets U and V such that every u in U is less
than every v in V?

