
PyPy v7.1 released; now uses UTF-8 internally for Unicode strings - mattip
https://morepypy.blogspot.com/2019/03/pypy-v71-released-now-uses-utf-8.html
======
Animats
Oh, nice.

From PyPy's Twitter feed:

 _" unicode-utf8 just got merged! Unicode strings are now internally
represented as utf-8 in PyPy, with an optional extra index data structure to
make indexing O(1). We'll write a blog post about it eventually."_

Python strings are random access indexable. UTF-8 has variable length
characters, so it is not directly randomly accessable. So the Python 3
representation inflates them to 1, 2 or 4 bytes per character, depending on
the longest character in the string.

In practice, most strings are not random accessed. Most subscripts are +-1
from a previous subscript. So most actions can be optimized into uses of
"advance one UTF-8 char" or "back up one UTF-8 char". Hard cases require
generating an index array of the entire string. That takes a full string scan.
Apparently that's what PyPy is doing.

This raises the cost of functions like "find()", which search the string for
something and return an integer which is a position in the string suitable for
random access use. I'd once suggested that "find", etc. return an opaque
object which was really a byte index into the string. If you used that in a
subscript, the right thing would happen. If you converted it to an integer,
the index array would have to be computed. Is PyPy doing that, or does the use
of "find" force index generation?

~~~
geophertz
Sorry if I am totally wrong but could you optimise UTF-8 by using some kind of
character to say this is a UTF-8 character. For example:

AÉBéC would give

0x41FF42FF42

where FF would mean, refer to another table with the index of the UTF8 char
like the following:

\---------------

|------|------|

|0 |1 |

|------|------|

|0xC389|0xC3A9|

|-------------|

This would make random access fast however will also increase overhead at
other places.

Also IIRC, UTF8 doesn't use values 0x80 <= unused_utf8_values <= 0xFF. Value
between 0x80 and 0xFF could be used to refer to an index of the table. eg 0x80
= index 0, 0x81 = index 1 ... 0xFE = index 126 and 0xFF = refer to other
table.

Regardless of values over 0x80. The index in the special table will still have
to be kept when iterating over it for find and stuff.

EDIT: table formatting

~~~
Animats
You can recognize a "UTF-8 character" in a UTF-8 string easily. You can get
the next character, and you can back up to a previous character, all
unambiguously. The encoding is kind of neat. See Wikipedia.

------
hermitdev
Curious, any one know why 64-bit Windows isn't supported? Is it because a long
is only 32-bits?

I'd potentially willing to help with the development to get that working.
Current day job is a 64-bit Windows shop in finance and we have several teams
expanding their use of Python with very large datasets. A 64-bit PyPy on
Windows could be big for us.

~~~
mattip
Here is an explanation and a suggestion on how to implement it.

[http://doc.pypy.org/en/release-2.6.x/windows.html#what-is-
mi...](http://doc.pypy.org/en/release-2.6.x/windows.html#what-is-missing-for-
a-full-64-bit-translation)

Please help us, we are a small team. The best way to reach us is through IRC
on the #pypy channel

------
spenrose
Great work, thank you so much! I was looking for PyPy vs C-Python 3.x speed
comparisons on the site and not finding any. Do you publish them?

~~~
mattip
Yeah, about that. Speed.pypy.org is having problems updating, the seven year
old fork of codespeed needs refreshing. The benchmarks there are python2, so
we would need to rework them for python3.

In the meantime, the python perf and performance packages are driving
speed.python.org, and I started trying to get pypy3 runs reported there, but
we need to provide a warmup parameter for each benchmark so that too will take
some time.

The best benchmark of course is your use case, try it out and let us know
where we are slow

------
hsivonen
Is there a write-up of the details of the PyPy UTF-8 Unicode string internals?

~~~
mattip
Not really. The UnicodeObject has utf8 bytes, a length in codepoints, and an
optional index map that is only calculated if needed to traverse the string.
Most strings won’t need the map. String matching is utf8 aware, and indexes by
codepoints.

The main speedup is in converting unicode to internal representation once,
then passing around the object with no further conversions.

Not sure that answered the question though.

------
therealmarv
why is PyPy not just skipping current Python 3.6 effort and goes directly to
3.7 or at least doing that once in a while (skipping in between versions) ?

~~~
pritambaral
I'd guess because that would require implementing the 3.6 features anyway.

------
ChrisSD
So does this mean that RPython is now UTF-8 only?

~~~
mattip
No. RPython still supports unicode strings. It now also supports utf8 string
classes for things like regular expressions and ffi calls. PyPy just uses the
non-unicode rpython apis.

~~~
akvadrako
UTF-8 is Unicode - your post doesn’t make much sense.

~~~
simcop2387
UTF-8 is one way to encode Unicode code points. There's also UTF-16LE,
UTF16-BE, UTF-32LE, UTF-32BE, and others like UTF-7. This makes pypy use UTF-8
internally for the additional memory and speed savings, at the cost of some
time later on when you use certain features that require the string index to
be built.

------
truth_seeker
Is anybody using pypy for big data and machine learning tasks ?

