
Fast and formally verified C implementation of Base64 - lelf
http://corp.galois.com/blog/2013/9/11/high-assurance-base64.html
======
eru
The people at Galois do so many fascinating things. I recommend watching some
of the talks they gave at each year's ICFP / CUFP. (Pick any, or all, if you
have the talks.)

------
afhof
"Beware of bugs in the above code; I have only proved it correct, not tried
it."

~~~
sigstoat
spewing this quote with no context or further comment is just tedious FUD.

~~~
jacquesm
It serves as a reminder that even though this implementation has been proven
correct that does not rule out bugs in the proof or in the method used to
create the proof.

~~~
gsnedders
Or that your compiler produces code equivalent to the (proven) correct C.

------
jstanley
I like the Duff's device-like loop entry:
[https://github.com/davidlazar/base64/blob/master/base64encod...](https://github.com/davidlazar/base64/blob/master/base64encode.c#L35)

~~~
cbsmith
I wouldn't even really call that a duff's device like entry. It's a pretty
standard implementation of a state machine...

I'm still really scratching my head questioning to what extent you can really
consider this implementation a "coroutine" solution. It sure looks to me like
a state machine solution.

------
randomvisitor
It's pretty impressive that this can be done completely automatically (model
checking), although the price to pay is that he can only prove the
implementation for a fixed input size.

I'm very curious about what would happen if he dropped the safety check in
base64_encode_value(). It would probably speed things up a little, but would
the proof fail?

~~~
cbsmith
Yeah, I thought the safety check was rather curious really. It was also odd to
use a NULL terminated string for the encoding set instead of a straight
64-byte array (sure, you only save 1 byte, and most compilers will no doubt
waste that byte ANYWAY just for memory alignment reasons, but it still struck
me as surprisingly... imprecise).

------
waffenklang
Good example for the weaknesses of the 'formally verified' sale strategy. It
may be formally correct, but the code is full of weaknesses, like no NULL
checks, no out of bounds checks, one endless loop based on pointer
arithmetic..

good and fast c code looks not like this.

~~~
kd0amg
_no NULL checks, no out of bounds checks_

I would expect formal verification to make many instances of these
superfluous. If this code genuinely needs to have these added, that sounds
like a flaw in their verification system.

~~~
waffenklang
I agree in the point that its the task of the fv-system to take care of this,
and so its a flaw in it, if its necessary.

imho,in this case it is a flaw, as the 'public' api function base64_encode
wants the user to input pointer to 2 ptr to buffer with only for 1 given the
length. and the code makes implicit assumption of the user input without
checking it (encoded is manipulated).

~~~
randomvisitor
It's a flaw in the specification of the theorem, not in the verification.

The way it works is that they choose fixed sizes INLEN and OUTLEN, and prove
that when data points to a buffer of INLEN bytes, they can allocate a result
buffer and base64_encode(data, INLEN, result) will compute the correct
encoding.

So, implied in this statement is the fact that the input buffer is valid (thus
not null), and that the output buffer must be valid (and not null) and
disjoint from the input buffer.

There are more fundamental classes of bugs that this won't detect:

\- behavior for arbitrary sizes, such as integer overflows for large sizes

\- bugs when using particular functions of the API, such as
base64_encode_update()

The latter problem means that there is no proof that b64enc.c is correct at
all, even for a fixed size: it's entirely possible that multiple calls to
base64_encode_update() would corrupt the internal state.

~~~
cbsmith
> \- behavior for arbitrary sizes, such as integer overflows for large sizes

Actually, I don't see how they could have a bug of this nature unless the
caller actually passed in invalid data. In the context in question, they have
64-bit pointers, and they are using unsigned 64-bit values (which conveniently
have defined overflow semantics) for the lengths of the input buffer.

Even if the calculation for data_end overflows, that would mean one of two
things: the caller improperly specified the buffer, or the memory model really
does have a buffer whose addressing overflows. In the latter case, because
they use an equality test and they have no check for magical address values
(like say, NULL) of currbyte, I don't see how that'd be a bug in the
implementation.

> \- bugs when using particular functions of the API, such as
> base64_encode_update()

You mean bugs when using specific parts of the API in ways that are not
consistent with their specification. It's hard to see why you'd even consider
that a bug in the implementation. That's a bug in the caller.

> The latter problem means that there is no proof that b64enc.c is correct at
> all, even for a fixed size: it's entirely possible that multiple calls to
> base64_encode_update() would corrupt the internal state.

Are you sure about that? First, there doesn't appear to be much that would
qualify as "internal state" to base64_encode_update(). The closest thing to
internal state it has is base64_encodestate, which really only captures which
of three states the state machine is in and a result code, both of which are
provably in the correct state at the end of each invocation of the function.

I'm not clear as to why you think the proof doesn't cover
base64_encode_update() being invoked multiple times. It's hard to know without
understanding what their simulator does, but it sure looks and sounds like it
is expanding the full set of possible execution paths through the b64enc,
which would include up to N invocations of base64_encode_update(). The only
way I can imagine it wouldn't be proving correctness for multiple
base64_encode_update() calls is if the simulator always had fread() return
ball all requested bytes AND you never proved it for cases where N > BUFSIZE
(which would just be silly).

~~~
randomvisitor
> they are using unsigned 64-bit values (which conveniently have defined
> overflow semantics) for the lengths of the input buffer.

Yes, if you examine the code you can manually prove that it's safe: but that's
not proven by their formal computer-checked proof.

As for the rest of your post: you seem to be assuming that what they proved is
that b64enc.c behaves correctly. This is not true at all: they just proved
that the helper function base64_encode() works correctly for a specific size
(this is done in proof/sym_encode.c). This helper function never calls
base64_encode_update() twice and is not used by b64enc.c, and that's why we
don't know if the multiple-block implementation from b64enc.c actually behaves
correctly.

------
rwmj
I particularly like how the C code is licensed "without a warranty":

[https://github.com/davidlazar/base64](https://github.com/davidlazar/base64)

~~~
deletes
Almost every free or open source code has the line "without a warranty"
somewhere in the agreement.

For example: GNU General Public License:
[http://www.gnu.org/licenses/gpl.html](http://www.gnu.org/licenses/gpl.html)

Mentioned several times.

------
easytiger
is this formally verified? I'm not sure whats going on but from a quick glance
its just comparing the output of two implementations?

~~~
panic
Instead of brute force, it looks like it's comparing automatically-generated
models of the two implementations symbolically.

 _Success! Amazingly, this proof systems scales to large values of n where
exhaustive checking is not feasible:

    
    
        $ time make n=1000
        Proving function equivalent to reference:
        encode_aig : [1000][8] -> [1336][8]
        Q.E.D.
        real: 17.882s  user: 16.31s  cpu: 1.50s*

~~~
pjscott
Modern SMT solvers are astonishingly fast for most real-world inputs. Not bad,
considering that the problem they solve is NP-complete!

------
mbq
... from a company working for the NSA. (;

