
Changes to Ragel in Response to the CloudFlare Incident - LaFolle
https://www.colm.net/news/2017/02/28/changes-to-ragel-cloudflare.html
======
wyldfire
For those of us who knew about the incident but not the nitty-gritty details
-- "What is Ragel?"

> Ragel compiles executable finite state machines from regular languages.
> Ragel targets C, C++ and ASM. Ragel state machines can not only recognize
> byte sequences as regular expression machines do, but can also execute code
> at arbitrary points in the recognition of a regular language.

> What kind of task is Ragel good for?

> Writing robust protocol implementations.

~~~
tyingq
Cloudflare's postmortem also goes into some detail on what they did wrong with
Ragel: [http://blog.cloudflare.com/incident-report-on-memory-leak-
ca...](http://blog.cloudflare.com/incident-report-on-memory-leak-caused-by-
cloudflare-parser-bug/)

~~~
runevault
Believe they originally just stated Ragel had problems and only fixed it to
state that they miswrote their stuff after the creator reached out to them.

~~~
tedivm
Cloudflare is really failing to inspire confidence here. Between them
downplaying the issue, refusing to work with the Google engineer who found it,
blaming the proxy owners (such as google) for not cleaning up their mess
quickly enough, and attempting to dump the blame on groups like Ragel- I just
have a hard time believing that Cloudflare will ever handle a security
incident properly after this mess.

~~~
ethbro
You mean by being transparent as opposed to some of the other serious security
incidents out there?

Could you please clarify exactly how they "downplayed the issue"? The impact
is an exercise in statistics. They seem to have compiled what logging /
statistics they could, then accurately advised everyone of what they found.

They turned around an initial fix on the issue in hours, no?

As for web caches, what are they supposed to do, other than plead for their
owners to delete them ASAP?

As for Ragel, not sure how the initial version of the response read, but the
one I read clearly pointed out it was a Cloudflare issue. And given the
semantics of the issue, I'd cut them some slack if they misidentified the
cause the day after they'd been scrambling to fix this.

I'd be interested who you _do_ trust to properly handle a security incident if
you're looking for better than this. IBM? Microsoft? Apple? Just try and get a
24-hour fix out of them.

~~~
cbr
At least with "blaming the proxy owners (such as google) for not cleaning up
their mess quickly enough" I think tedivim is probably referring to:
[https://news.ycombinator.com/item?id=13721644](https://news.ycombinator.com/item?id=13721644)

~~~
ethbro
I saw that comment when it hit the original thread. I think commenters are
conflating Project Zero with Google search cache in the usual version of this
sentiment.

If I was Cloudflare, and someone had made me aware of this bug, I'd be griping
as loudly as possible to get remaining caches cleared ASAP. It's all they
really can do.

The fact that Bing had unlocated cached documents doesn't impact on Google
search being quick or not to clear their own caches.

Did Cloudflare cause all this? Yes. Are they working in the interest of their
customers to gripe at remaining cachers to refresh their caches? Also yes. Is
there something else they could do now that we're here? I can't think of
anything more effective.

------
devy
Honest question: would Ragel be more robust if it were written in Rust and to
generate Rust code?

~~~
JoachimSchipper
Not really.

Safe code generating unsafe-code-with-bugs gets you a vulnerable program, so
rewriting Ragel in Rust doesn't help.

As to generating Rust - Rust isn't magic, and Rust-that-is-as-fast-as-C can't
do a lot more work than C. In particular, you don't want to do memory
allocation in the fast path of a network server written for performance (note
that Rust uses jemalloc, which is a C memory allocation library; Rust memory
allocation isn't going to beat C memory allocation.)

As such, a realistic Rust program probably needs to re-use network buffers, at
which point language-level security doesn't stop you from sending out "old"
data that was still in the buffer. [EDIT: but see the reply
[https://news.ycombinator.com/item?id=13804831](https://news.ycombinator.com/item?id=13804831)
below.] A debug build in Rust could allocate-per-request and detect the
problem; but allocate-per-request in C is also likely to detect such overflows
(indeed, the CloudFlare server _did_ occasionally crash on this bug), and C
programmers can also use e.g. VALGRIND_MAKE_MEM_UNDEFINED() for effort
comparable to maintaining an alternate allocate-per-request implementation in
Rust.

The difference for the fast path - where Rust can't do a lot more work, and
where you can focus your effort - just isn't very large. (Using a safe
language for e.g. low-performance control-plane interfaces _can_ help, however
- but at that point, you may well be arguing e.g. Rust vs. Lua.)

~~~
Manishearth
Rust's version of Ragel actually generates 100% safe Rust code, AFAICT without
any additional cost.

[http://github.com/erickt/stateful](http://github.com/erickt/stateful)

~~~
JoachimSchipper
Are you sure? CloudFlare's problem seems to have been running off the end of a
logical buffer embedded in a linked list of "physical" buffers.

(Sorry, have to run - will try to take a better look later tonight.)

~~~
Manishearth
In Rust you'd usually use a slice for this anyway, which has bounds checks.
Yes, you could use indices instead, but that's not very Rusty (and more
annoying to use!), whereas using pointer arithmetic is par for C.

Edit: Yes, you could ultimately get it wrong in Rust, too, I just think it's
much less likely.

~~~
JoachimSchipper
So you're thinking of an iterator of slices of char (arrays), right?

It still somewhat tricky to get right to me; the writer wants to have access
to the underlying physical buffers, so you need a conversion step from list-
of-physical-buffers without allocating, etc. But you're probably not wrong
that a Rust programmer is more likely to get this right; thanks!

(And it's all likelihoods anyway; there's plenty of C code that _doesn 't_ get
this wrong, and e.g. the VALGRIND_MAKE_MEM_UNDEFINED() I suggested above
_does_ reliably - albeit dynamically - find this problem in C.)

(Added a reference from my top comment to your reply.)

~~~
Manishearth
You don't need to allocate, Rust slices are safe zero-copy views into an
existing buffer.

Yes, you can run under valgrind, but dynamic checks ... :)

You need to have testcases that hits those code paths, and it's really hard to
make "unexpected" testcases that test code paths like these. That's why
fuzzing is awesome!

