Crema: A Sub-Turing Programming Language

schoen · on Aug 5, 2015

This is comparable to a more practically usable version of Douglas Hofstadter's BlooP.

https://en.wikipedia.org/wiki/BlooP_and_FlooP

The inspiration for claiming this more restricted language is a security benefit comes directly from Sassaman, Patterson, Bratus, and Shubina:

http://langsec.org/papers/Sassaman.pdf

(See "Principle 1". "Computational power is an important and heretofore neglected dimension of the attack surface. Avoid exposing unnecessary computational power to the attacker. An input language should only be as computationally complex as absolutely needed, so that the computational power of the parser necessary for it can be minimized. For example, if recursive data structures are not needed, they should not be specified in the input language.")

Further research on these ideas has continued under the name of the "language-theoretic security research program".

http://www.langsec.org/

This language is surely intended as a contribution to that project. (The behavior of programs in Crema, as in BlooP, is decidable, which should help with correctness verification.)

kazinator · on Aug 5, 2015

You can easily put a cap on computational power without dumbing down the language, via external resource limits: how many CPU cycles can be executed (or abstract processor time), how much memory can be allocated, and how deep the stack can go.

Such parameters are easily administered and easy to understand for IT people.

The limitations in a less-than-Turing language don't nicely translate to a hard cap on the running time. Attackers can find convoluted programs which maximize running time by skirting every available limit.

schoen · on Aug 5, 2015

I don't think this is the right sense of "computational complexity" or of the constraints here. Consider:

  ulimit -t 1; nc -l 55555 | xargs -0 python -c

versus

  python -c 'print len([n for n in xrange(10**15) if str(n) == str(n)[::-1]])'

The second one can (and will) use more CPU time than the first one, guaranteed, but the second one is safer and we can have a very concrete understanding of why.

I think in LANGSEC, the crux of the matter is decidability, not time complexity. And those are ultimately distinct issues, even though both are studied in theoretical computer science.

kazinator · on Aug 5, 2015

The second one is safer because you control the code; you're not executing code read from a network.

I don't believe that this is the topic here (comparing untrusted versus trusted code).

The issue is: if we already have a properly sandboxed language, such that we can expose it as an input language, we can still be DDoS attacked by inputs that perform "too much computation". They take too long, and/or chew memory. Sandboxing has to take these attacks into account also, not only restricting what the code has access to.

(My point was that external resource limits can achieve this without crippling the expressivity of the language.)

> I think in LANGSEC, the crux of the matter is decidability, not time complexity.

I see that.

From langsec.org: "LANGSEC posits that the only path to trustworthy software that takes untrusted inputs is treating all valid or expected inputs as a formal language, and the respective input-handling routines as a recognizer for that language. The recognition must be feasible, and the recognizer must match the language in required computation power."

Okay, so that implies decidability. The LANGSEC approach seems to rule out inputs which are Turing complete computational languages, because recognizing whether they are valid means running them, and is thus equivalent to the Halting Problem.

Time complexity is secondary to the guarantee that recognition terminates (deciding yes, the input conforms or no it doesn't). That's a matter of tuning the permitted input size versus the recognizer algorithm's asymptotic complexity.

However, in some application domains we must have inputs (or do have them, in any case) which are in fact computational languages (e.g. Javascript inputs to a browser). We cannot banish these because some LANGSEC forum wants every input to be recognized syntactically as a piece of formal syntax, not having any semantics as a piece of code which unfold only through execution. Still, such inputs should be treated formally as much as possible rather than in an ad hoc way.

adrianton3 · on Aug 5, 2015

Agreed, regular expressions are not turing complete and they can still make your browser ponder for a bit if carefully crafted. Try

  /(x+x+)+y/.test('xxxxxxxxxxxxxxxxxxxxxxxxxxxxx')

for instance. This regex doesn't even use non-regular extensions; it's expressible in a very limited computational model, it fits in a tweet and yet it renders my browser unresponsive.

I can do even more damage with a nasty shader; GLSL is not turing complete either (in fact it looks a lot like Crema).

black_knight · on Aug 6, 2015

Well, there are faster implementations of regexp than the one your browser most likely is using [1].

[1] http://swtch.com/~rsc/regexp/regexp1.html

schoen · on Aug 5, 2015

That's a nice example.

hobo_mark · on Aug 5, 2015

Security is one reason why the bitcoin 'language' is intentionally restricted, I wonder in ethereum (which also uses a blockchain, but is turing complete) what kind of integrity guarantees can be proven when executing a transaction.

seabee · on Aug 5, 2015

Time and space are bounded by gas; that ought to be a sufficient guarantee.

qznc · on Aug 5, 2015

Are there also language-based approaches to defend against timing attacks? Are there languages which zero buffers/variables after use?

vilhelm_s · on Aug 5, 2015

There has been some work on language-based mitigation of timing channels, e.g. http://www.cs.cornell.edu/andru/papers/pltiming.html and http://www.cse.chalmers.se/~russo/publications_files/ESORICS... .

schoen · on Aug 5, 2015

I don't think those topics have been addressed in LANGSEC (although it might be easier to examine all the possible timing properties of code in a non-Turing-complete environment).

An approach to the second topic is described in

https://www.usenix.org/legacy/event/sec05/tech/full_papers/c...

qznc · on Aug 5, 2015

Thanks! The paper is interesting, but not language-based. They change runtime (more precisely libc-free) and kernel.

I was thinking about something different. In C++ you could create a wrapper type (similar to smart pointers) which zeros on deallocation. However, there is the problem that the compiler could optimize the zeros away, because nobody reads those writes. So you really need language/compiler support to do this correctly and future-proof.

dgreensp · on Aug 5, 2015

There's no clear value proposition given. What are we sacrificing Turing-completeness in order to gain? For example, one thing a programming language might gain by sacrificing Turing-completeness is that every program provably halts; no infinite loops! This is called "total function programming." There are probably other interesting properties you can gain by sacrificing Turing-completeness as well.

Crema can restrict the computational complexity of the program to the minimum needed to improve security

"Computational complexity" is a technical term that is related to performance, but not security.

otabdeveloper · on Aug 5, 2015

> There are probably other interesting properties you can gain by sacrificing Turing-completeness as well.

Not the author, but I wrote another sub-Turing language. In my opinion, there's two useful properties: a) type inference -- you can infer the type of the whole program b) you can infer memory use and thus avoid the need for garbage collection or manual memory management. Both properties are very useful for performance, of course. :)

Ono-Sendai · on Aug 5, 2015

Hi, I'm writing a total functional language as well. I think that garbage collection is still useful however, as it can collect garbage as the program runs, not just at the end.

otabdeveloper · on Aug 6, 2015

I've managed to make it so my memory use is completely deterministic -- no garbage at all.

Ono-Sendai · on Aug 6, 2015

That sounds interesting. Any references or notes on that I can read? (or source code :) )

I would say that just because your memory use is deterministic, doesn't mean you don't get garbage. Garbage is just memory that isn't needed any more.

for example, in

  let
    a = f(x)
    b = g(y)
  in
   a + b

If there is working memory used in the body of f(), it will be garbage when f() has finished executing, because it won't be referred to any more. So it could be freed after f is executed and before g is executed, which may reduce the maximum amount of memory required by the program.

sirseal · on Aug 5, 2015

From the Crema wiki that I just glanced at, it is indeed not possible to write infinite loops:

"The only type of loop supported by Crema is the foreach loop. This loop is used to sequentially iterate over each element of a list. Crema specifically does not support other common loops (such as while or for loops) so that Crema programs are forced to operate in sub-Turing Complete space. In other words, foreach loops always have a defined upper bound on the number of iterations possible during the loop's execution. This makes it impossible to write a loop that could execute for an arbitrary length of time."

DenisM · on Aug 5, 2015

This is called "total function programming."

Thanks for bringing it up, I was not aware of this: https://en.wikipedia.org/wiki/Total_functional_programming

Recently I had a need to design a small language for a constrained environment, and I spent a bit of time worrying about termination problem. Now I know where to look for more research.

samuell · on Aug 5, 2015

Well, from my understanding, I would say "security" is a very broad term, being influenced by a broad range of factors including "computational complexity", don't you think?

GFK_of_xmaspast · on Aug 5, 2015

If all you're doing is sacrificing the possibility for infinite loops, just hack a c++ runtime that kills a process after 24 hours, and blammo non-Turing complete.

otabdeveloper · on Aug 5, 2015

Cool!

Another sub-Turing language is my own 'tab', (https://bitbucket.org/tkatchev/tab), a text processing language/utility.

'Tab' was born out of a practical need to process text files in a manner similar to SQL statements, so the focus is different from research languages.

Hope the trend catches on, sub-Turing languages are cool.

tromp · on Aug 5, 2015

Instead of Crema's

    int i = 0
    int values[] = [6, 3, 8, 7, 2, 1, 4, 9, 0, 5]
    foreach (values as dummy) {
      int_print(values[i])
      i = i + 1
    }

it would be more straightforward to define loops over array indexing ranges, as in

    foreach (int i indexing values) {
      int_print(values[i])
    }

adrusi · on Aug 5, 2015

It would seem that you could implement that abstraction within the language

    def int int_indexing(int values[])[] {
        int i = 0
        int result[] = list_create(list_length(values))
        foreach (values as dummy) {
            result[i] = i
            i = i + 1
        }
        return result
    }

I haven't tested this, and I had to look in the source code to find the list_create function from the standard library. Of course, since the language lacks type parameterization, you'd need a different indexing function for each type of array.

capitalsigma · on Aug 5, 2015

Do they forbid recursion? I don't see anything about it on the wiki, but I assume they must.

swolchok · on Aug 5, 2015

crema/tests/fail/ includes several tests relating to recursion, including the obvious mutual recursion workaround attempt. Given that they're in the fail subdirectory, I conclude that it's testing that they're rejected.

tromp · on Aug 5, 2015

Presumably you can only call functions defined earlier.

Ono-Sendai · on Aug 5, 2015

not necessarily, you can bound recursion.

black_knight · on Aug 5, 2015

As I understand it, Crema's fixed length loops gives it the computation power of primitive recursive functions. Which is enough for most programming applications. However, that assumes that a program is something which runs, computes a result and is done. A lot of programs fall outside that scope — even if they do not require more computational complexity per se.

A daemon or an operating system are typical examples of programs without bounded computation times. Not because of their computational complexity, but because they continuously (or by demand) produce new results. The interesting property for them is not termination, but productivity. They must keep being productive, and not become unresponsive. This is much more subtle notion than termination, but I think the programming language Agda has come some way with its co-recursive data types.