

Utf8rewind changes for 1.2.1 - knight666
https://bitbucket.org/knight666/utf8rewind/wiki/Changes%20for%201.2.1

======
dantiberian
To take this another step further, you can look at generative, or property
based testing. You specify properties of your input data, e.g. A sequence of 0
or more Unicode characters, and the properties that should hold, e.g. It never
crashes, the length stays the same after processing, e.t.c. Then the generator
will generate ever more complicated test cases to try to break the invariants,
if it does then it will backtrack to find the smallest failing test case.

I find generative tests take a lot longer to think about and write, but are
invaluable for flushing out all the corner cases. There are libraries for lots
of languages, Haskell, Clojure, Python, Erlang, and others too I'm sure.

~~~
seanwilson
This is a good starting point to find such tools:

"QuickCheck is a combinator library originally written in Haskell, designed to
assist in software testing by generating test cases for test suites. It is
compatible with the GHC compiler and the Hugs interpreter.

In QuickCheck the programmer writes assertions about logical properties that a
function should fulfill. Then QuickCheck attempts to generate test cases that
falsify these assertions. The project was started in 1999. Besides being used
to test regular programs, QuickCheck is also useful for building up a
functional specification, for documenting what functions should be doing, and
for testing compiler implementations.[1]

Re-implementations of QuickCheck exist for C,[2][3] C++,[4][5][6] Chicken
Scheme,[7] Clojure,[8][9][10] Common Lisp,[11] D,[12] Elm, [13] Erlang,
F#,[14] Factor,[15] Io,[16] Java,[17][18][19] JavaScript,[20] Node.js,[21]
Objective-C,[22] OCaml,[23] Perl,[24] Prolog,[25][26] Python,[27] R,[28]
Ruby,[29] Rust,[30] Scala,[31] Scheme,[32] Smalltalk,[33] Standard ML[34] and
Swift.[35]"

[https://en.wikipedia.org/wiki/QuickCheck](https://en.wikipedia.org/wiki/QuickCheck)

------
TheLoneWolfling
Alright, this is yet another example of a title change that goes from a good
title to a bad title.

If I hadn't have seen the article before the title was changed, I wouldn't
have read it. "Utf8rewind changes for 1.2.1" isn't exactly the best title.

Please change it back (to "Utf8rewind 1.2.1 adds 750 new unit tests for its 13
functions"). That title actually indicates what is interesting about the
article, and why it was submitted, not just "here's a random changelog for a
random version of a random small library that no-one has ever even seen". HN's
title policy seems to have become worse and worse over time, and this is a
good (bad?) example thereof.

------
joosters
Do they use any fuzzing as part of their testing? UTF-8 parsing seems like an
ideal candidate for this kind of bug hunting.

~~~
knight666
Author here.

The library currently doesn't employ any kind of fuzzing. I rely on multiple
tests for every kind of input. I've found that is a pretty reliable way to
test for "unknown unknowns", even if it's extremely time-consuming.

Adding testcase fuzzing is definitely something to consider though, because it
would most likely have found the very issues this release fixes.

~~~
Ono-Sendai
I recommend trying fuzz testing, it can be very effective (see my experiences
here:
[http://forwardscattering.org/post/21](http://forwardscattering.org/post/21))
Utf8-rewind looks like a nice library, I have looked into it before when
researching unicode support. If I need to do more complex Unicode stuff in my
language I will probably use it.

------
cautious_int
I think you have a bug in codepoint_read(). The last argument is a pointer to
variable unicode_t named decoded. If the first if statement in that function
is true, then the a value is never assigned to that variable, the function
returns, and immediately outside the function the value of that variable is
read. But that variable was never initialized.

[https://bitbucket.org/knight666/utf8rewind/src/c22e458912952...](https://bitbucket.org/knight666/utf8rewind/src/c22e4589129523ae466a3665731cd7cc120233f8/source/internal/codepoint.c?at=175#codepoint.c-180)

[https://bitbucket.org/knight666/utf8rewind/src/c22e458912952...](https://bitbucket.org/knight666/utf8rewind/src/c22e4589129523ae466a3665731cd7cc120233f8/source/utf8rewind.c?at=default#utf8rewind.c-372)

[https://bitbucket.org/knight666/utf8rewind/src/c22e458912952...](https://bitbucket.org/knight666/utf8rewind/src/c22e4589129523ae466a3665731cd7cc120233f8/source/utf8rewind.c?at=default#utf8rewind.c-375)

The prior while statement makes sure the second parameter src_size is never 0,
but the pointer src might be.

Undefined behavior can be caused by passing a NULL pointer as the first
parameter and a non-negative second parameter to utf8toutf16().

~~~
knight666
This is actually guarded against in a macro:

    
    
        #define UTF8_VALIDATE_PARAMETERS(_inputType, _outputType, _result) \
            if (input == 0) { \
                UTF8_SET_ERROR(INVALID_DATA); \
                return _result; \
            } \
    

If a NUL parameter is passed to utf8toutf16 and friends, they'll return
UTF8_ERR_INVALID_DATA.

However, you are correct about codepoint_read's behavior. I actually have a
test for passing NUL to the function, but the value of decoded isn't checked:

    
    
        TEST(CodepointRead, InvalidData)
        {
            const char* i = nullptr;
            size_t il = 0;
            unicode_t o;
    
            EXPECT_EQ(0, codepoint_read(i, il, &o));
        }
    

It's definitely a bug, but relatively harmless. The function is only used
internally and the calling sites guard against that kind of attack.

Nevertheless, I thank you for the extra set of eyeballs. I have been looking
at these functions far too long, you tend to gloss over obvious problems after
a while.

I'll be sure to get a fix in for the next release. :)

------
asveikau

        StreamState stream[4];
        // many lines follow
        memset(stream, 0, 4 * sizeof(StreamState));
    

It bugs the hell out of me to see array bounds and type names repeated like
this. Why not this?

    
    
        memset(&stream, 0, sizeof(stream));
    

Some other nitpicks:

* typedefs ending in _t are reserved by POSIX, though a lot of people ignore this.

* Have you thought of breaking into multiple source files? One benefit is when linking statically you won't pull in the whole thing for just one function. Might read better too, there is a lot going on in that one source file.

~~~
cautious_int
And many C Standard types have a _t suffix. It really shouldn't be used.

There are multiple source files.

~~~
asveikau
> There are multiple source files

Ah you are right, I guess I was looking only at public api wrappers and missed
the "internal" dir. I might add that as further commentary that people might
gloss over a dir called "internal" and miss the bulk of the library. :P

------
cautious_int
Wow, this is interesting. I have never heard of it before. How popular/tested
is this library?

Will try to build it for my project that deals with unicode.

~~~
knight666
You've probably never heard of it because it has 136 downloads in total. ;)

It's very thoroughly tested but it hasn't been used in any serious projects
yet.

------
faragon
Is the Turkish case conversion working?

edit: it seems it is: [https://bitbucket.org/knight666/utf8rewind/pull-
requests/1/t...](https://bitbucket.org/knight666/utf8rewind/pull-
requests/1/turkish-sentence-fixed/diff)

~~~
knight666
No, currently Turkish, Greek, Lithuanian and Azeri case mappings are not
always grammatically correct. This is because utf8rewind currently does not
take the system locale into account when case mapping. Fixing these issues is
planned for a future release.

~~~
faragon
You can consider also set system-independent locale, e.g. "set_turkisk_mode"
(I had that problem, too), etc. I thought that the only case conversion
exception as the Turkish case. Do you remember which cases are an exception
for Greek Lithuanian and Azeri? Also, I know that also German has some non-
bijective cases ("ß" -> SS).

In case you want to save space in tables, you can opt for encoding ranges in
the code, e.g. check sc_tolower()/sc_toupper() into:
[https://github.com/faragon/libsrt/blob/master/src/schar.c](https://github.com/faragon/libsrt/blob/master/src/schar.c)

~~~
knight666
You can find all case folding codepoints on the Unicode Consortium website:
ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt and
ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt.

As you get down the list you'll notice what a pain in the ass the special
cases are. There's a special case for the final sigma in a Greek word:

    
    
        03A3; 03C2; 03A3; 03A3; Final_Sigma; # GREEK CAPITAL LETTER SIGMA
    

You must remove the dot from "i" when upper or titlecasing... but only in
Lithuanian:

    
    
        0307; 0307; ; ; lt After_Soft_Dotted; # COMBINING DOT ABOVE
    

Etc. etc. By the way, my implementation for case mapping started out similarly
to yours, but I ultimately solved the problem using a binary search in a huge
look-up table:
[https://bitbucket.org/knight666/utf8rewind/src/c22e458912952...](https://bitbucket.org/knight666/utf8rewind/src/c22e4589129523ae466a3665731cd7cc120233f8/source/internal/casemapping.c?at=default#casemapping.c-45)

Unicode case mapping is just a huge mess of exceptions, but that's more the
humans' fault than the standard.

~~~
faragon
Thank you! :-)

------
luchs
Is the documentation available somewhere without having to download it first?

~~~
knight666
It wasn't before, but I looked up how to host static HTML on Bitbucket. You
can now read the documentation online here:

[http://knight666.bitbucket.org/utf8rewind/](http://knight666.bitbucket.org/utf8rewind/)

