
Unicode Regular Expressions - kawera
http://unicode.org/reports/tr18/
======
gok
Nice, two pieces of technology which are always implemented incorrectly
combined into a single standard.

------
glangdale
This is not really a standard; nearly any behavior which is complicated is
left undefined or expressed with vague, aspirational language ("an
implementation _should_ do this somehow, I dunno"). It doesn't look like regex
implementers were consulted at any time.

~~~
burntsushi
As a regex implementor, I am super appreciative of this document. I've
consulted it quite a bit, and it helped guide me towards picking and choosing
which Unicode features I should support and what their semantics should be.
The upshot is that many other regex engines have either inspired this document
or been inspired by it, which gives at least some consistency between
implementations.

I also think it's clear from reading this document many times that it was very
much informed by regex implementors. The text seems quite self-aware of the
fact that some features may not be practical to implement in all scenarios.

It is certainly far from perfect, and I bet that if we got in a room together,
we could vent about things in (or missing from) this document. But I am
determined not to let perfection be the enemy of good, and this document has,
at least in my experience, done much more good than harm.

~~~
jonstewart
Ditto to everything here. UTS #18 is a fantastic map for the implementor. I
co-authored a paper about it:
[http://www.dfrws.org/sites/default/files/session-
files/paper...](http://www.dfrws.org/sites/default/files/session-files/paper-
unicode_search_of_dirty_data_or_how_i_learned_to_stop_worrying_and_love_unicode_technical_standard_18.pdf)

~~~
glangdale
I still don't agree, although I respect both of your opinions as real regex
implementors without a doubt.

It's just too complex a 'standard'. The amount of crap they had to 'retract'
by the current version is telling, and features like 3.9 and 3.11 are
speculative design by standards committee, which is the exact opposite of how
these things should work. It's the ultimate unfunded mandate: "hooray, I'm
writing a standard, why don't I shove some random crap in there that no-one
implements now in existing libraries in the hope I get to set directions".

The standard should be _minimal_ and _well-defined_. This is neither.

It's also preposterously self-important. Yeah, I'm so fascinated by the
history of the Unicode standard I'm going to start introducing an "age"
property so I can regex-match on when characters were entered. There's a
certain attitude whereby these guys are just blithely shoveling junk into a
standard that no-one needs. If you're desperate for all these exotic character
properties put them in a godamn file as a character class and blat them in
wherever you need them (yes, it would be nice if regex had macros).

Somewhere in all this nonsense is a good standard struggling to get out. They
need to shit or get off the pot. Anything that they don't know how to do or
that no-one knows how to do should be completely deleted. Anything that they
offer 15 different choices as to how to do ... they need to figure out which
is the best one and pick that. It's a standards documents, it's supposed to
make choices, it's not a potted plant.

~~~
jonstewart
It has a bad case of featuritis; it reads both discursively and self-
consciously, like it knows that there's really no justification for using
regexps to specify named properties, but it just can't help itself.

Some of the Level 3 features venture too far afield from Unicode, too, e.g.,
3.11 is "implement YACC" and has nothing to do with Unicode and 3.7 is
"implement a streaming regexp engine, we dare you" (challenge accepted,
Unicode dweebs!).

So, it is a horrible standard by which to judge overall conformance, but no
one really does that. I think the value is in lighting the way through a broad
array of thorny Unicode topics. And I think that the implementor who grapples
with a majority subset of the sensible topics will wind up with an engine that
handles users' Unicode needs.

------
rspeer
If you want to work with these in Python, use the "regex" package. [1] Not to
be confused with the "re" module in the standard library.

Also, maybe there should be a (2016) or even a (2004) in the title! (2016 is
when this document was last revised, and 2004 is when it became a Unicode
standard.)

[1] [https://pypi.org/project/regex/](https://pypi.org/project/regex/)

~~~
jwilk
I don't think the regex package implements this. At least the hex notation is
not supported:

    
    
      >>> import regex
      >>> regex.compile(r'\u{3040}')
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/usr/lib/python3/dist-packages/regex.py", line 345, in compile
          return _compile(pattern, flags, kwargs)
        File "/usr/lib/python3/dist-packages/regex.py", line 507, in _compile
          caught_exception.pos)
      _regex_core.error: incomplete escape \u at position 2

~~~
rspeer
Hmm, okay. It seems to just use Python's unicode escape syntax, instead of
what the standard says. But the package does support things like the complex
character classes, and identifying word boundaries.

(The algorithm for identifying word boundaries in most languages without
ASCII/English assumptions is quite complex and useful, as I can say from
having tried to reinvent half of it before learning about the standard and the
regex package. It's not a panacea -- it won't do anything useful with
languages where word boundaries require lexical knowledge, like Chinese,
Japanese, and Thai -- but other than that it handles all the edge cases you
never would have thought of.)

~~~
burntsushi
Yeah, basically, the standard is a little sneaky on this point. It doesn't
actually require a specific syntax, but rather, simply that being able to
write Unicode codepoints in hexadecimal representation is _possible_. Notice
that it says, "... shall supply _a mechanism_ ..." rather than "this
mechanism."

Of course, it's probably a good idea to follow the sample syntax provided. :-)

------
lucio
Hearing "Unicode Regular Expressions" makes me think of some other "fantastic"
technology combinations:

\- XML configuration for Windows DLL dependency

\- CORBA aware DST Library

~~~
speps
> \- XML configuration for Windows DLL dependency

Oh boy! You're in luck today, you'll learn about side-by-side assemblies and
application manifests: [https://msdn.microsoft.com/en-
us/library/windows/desktop/aa3...](https://msdn.microsoft.com/en-
us/library/windows/desktop/aa375144%28v=vs.85%29.aspx)

------
ktpsns
This will make a difference how we write regexes. Look at the level 1
properties:

    
    
       [\p{letter} \p{decimal number}]
    

And compare for instance to the POSIX regexp character classes (a reference:
[https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basi...](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions#Character_classes))

    
    
      [[:alpha:] [:digit:]]
    

Given that the Unicode standard goes much beyond, proper implementations
should help to write regexps for the modern web, encountering problems we are
typically faced with. It's a pity the standard goes not into detail there,
thought (for instance, similiarity of characters).

~~~
JadeNB
> This will make a difference how we write regexes. Look at the level 1
> properties:

> [\p{letter} \p{decimal number}]

> And compare for instance to the POSIX regexp character classes

> [[:alpha:] [:digit:]]

No need for the future tense; Perl's had that at least since v5.12:
[http://perldoc.perl.org/5.12.0/perldelta.html#Unicode-
overha...](http://perldoc.perl.org/5.12.0/perldelta.html#Unicode-overhaul) .

------
paulpauper
Is there a limit to how complicated and ornate a character can be, besides
pixels. It's amazing Chinese characters can be rendered so well across many
browsers and websites

~~~
squiggleblaz
It depends on the font format. If TrueType doesn't let you code something, you
could use an SVG font etc. etc. Almost all fonts in the wild are
outline/scalable, so that imposes some limitations.

