
Unicorn: C++ Unicode string library - captaincrowbar
https://github.com/CaptainCrowbar/unicorn-lib
======
aurelian15
Looks like a nice project. I'm currently searching for a Unicode library and
it appears to me that ICU is the de-facto standard here, which has the benefit
of comming pre-installed on pretty much any Linux distribution. Any reason why
I should use Unicorn instead? I couldn't find information on how it compares
to ICU in the documentation (well, except for the most welcome usage of modern
C++).

~~~
rspeer
It looks like Unicorn can apply operations (such as regexes) to text that is
natively in UTF-8, giving it a distinct advantage over ICU, which was written
back when UTF-16 seemed like a good idea and has to convert everything into
UTF-16.

~~~
fantasticfears
It's hard but needed to differentiate between UTF-16 and UChar byte array.
UChar byte array are not essentially an well-formed UTF-16 string. Beyond, why
bother use UnicodeString? It's fairly easy to use. It covers the detail from
your sight.

It's indeed super cool to see a modern Unicode C++ library. But anyway, is it
really useful for production usage? The answer could be no. In contrast, ICU
was old, battle-tested, compact and well-tested.

~~~
rspeer
I'm talking about using UTF-8 as the string representation, not UChars. UChars
are an artifact of UTF-16, and thus require converting all text on input and
output, unless you work in a Windows API world where I/O is UTF-16.

Modern programming languages such as Rust gain efficiency by working with
unmodified UTF-8. All you lose is constant-time arbitrary indexing, which is a
bad idea in most cases anyway.

~~~
jstimpfle
Why is a bad idea? Because Unicode has too complicated semantics to split a
Unicode string at arbitrary points?

~~~
imron
Both utf8 and utf16 can contain multicharacter elements. If you split a string
at an arbitrary point you risk splitting it inside a multicharacter element.

This will be very common in utf8 that contains non-ascii characters, and very
rare with utf16 (only happens with characters outside the BMP).

Neither is something you want in your code, unless you think it's a good idea
to corrupt your users' data.

Edit: It's not too difficult to handle these cases and make sure you only
split at valid positions, but you do need to be careful and there are a number
of edge cases you might not think through or even encounter unless you have
the right sort of data to test with - which leads to lots of faulty
implementations. e.g. for years MySQL couldn't handle utf8 characters outside
the BMP.

~~~
jstimpfle
My parent was speaking about indexing at the _code points_ level, not at the
encoding (byte / character) level.

I do know that Unicode has _combining code points_ (confusingly called
combining characters) and nasty things like rtl switching code points. I guess
it's turtles all the way down.

~~~
vardump
> My parent was speaking about indexing at the code points level, not at the
> encoding (byte / character) level.

You need UTF-32 for (random) indexing of code points. UTF-16 has 16-bit _code
units_. Some UTF-16 _code points_ are 32-bits, using a surrogate pair.

So it's the same trade-off as with UTF-8. Thus no reason not to just simply
use UTF-8 in the first place and take advantage of the memory savings.

~~~
jstimpfle
Again, my original parent's statement was not about encoding or memory
savings. The statement was that it was a bad idea to index into an (abstract)
unicode string (of unicode code points -- not compositions thereof
whatsoever).

I didn't question that, but hoped to get some inspiration for sane usage of
unicode handling (which I'm not sure is humanly possible except for treating
it as a rather black box and make no promises).

~~~
imron
Your original parent was all about encodings, and mentioned it was a bad idea
to arbitrarily index in to utf8 strings, (no mention of abstract strings of
unicode codepoints).

> _languages such as Rust gain efficiency by working with unmodified UTF-8.
> All you lose is constant-time arbitrary indexing_

So it's saying Rust mostly benefits from using utf8, but in doing so, it loses
the ability to arbitrarily index a character in a string (in constant time).

If it was abstract strings of unicode codepoints then there is no problem -
except you'd then be using 32bits per codepoint.

------
cmrdporcupine
The unicode portion looks reasonable, but why is it necessary for it to
include its own flags, file io, file management, and environment classes?

Why is it so many C++ libraries fall into this habit of trying to build one
big framework. I'm perfectly happy with gflags -- a unicode library would be
nice for my project, but now I won't consider this library.

~~~
captaincrowbar
Because the whole point is to handle anything that needs Unicode support. A
library that only manipulated Unicode strings would be incomplete if you still
couldn't use Unicode in command line options, file names, etc.

~~~
cmrdporcupine
I would recommend breaking them off into separate additional libraries. I
don't need unicode for flags, so paying for it at compile and link time seems
unwise. Or provide adapter classes that can be used over other frameworks.
Just a suggestion.

------
vidoc
Seems like the word 'Unicorn' is currently _the_ buzzword of 2016 in tech!

------
maaku
Your github pages breaks the back button.

~~~
captaincrowbar
No idea what you mean, sorry. I'm just using Github's automatically generated
web pages, so if there's a problem there it's probably a Github issue.

~~~
geekone
Probably referring to the Documentation link you provide on the GitHub page,
and it breaks back button for me too.

~~~
captaincrowbar
I just tried that on several browsers; Safari and Chrome are fine, it seems to
be only Firefox that has a problem with that. I have no idea whether that's a
bug in Firefox or Github, and either way there's nothing I can do about it,
sorry.

~~~
funkaster
yes, you can: publish your docs as real web pages and not a link to the
htmlpreview of a file inside your repo. That should fix the problem.

~~~
dpark
I guess he should have said that there's nothing _reasonable_ he can do about
it. Creating an entirely separate set of HTML pages would require a new
publishing flow, add a new step every time docs update, and generally
encourage the docs to fall out of sync with the repo. He could do all of this,
or he could do the sensible thing and leave the docs exactly like they are.

------
xjia
Anyone can compare this to Boost.Nowide?

