Typed Programs Don't Leak Data

masswerk · on June 28, 2021

While security is – of course – a noble cause, this approach has many uses beyond that.

E.g., I just upgraded an old 6502 assembler to support complex value expressions, including variables. Now, the 6502 has two distinct address modes for accessing memory (r/w), a fast single-byte mode dedicated to so-called zero-page addresses (like 0x0020), and a normal one for word-size address operands. (The "zero-page" may be regarded as a bank of 256 auxiliary registers.)

A good assembler should determine the address mode automatically, and, at the same time, should provide means to express the wish for a value to be treated as a word-size address. A common convention is to assume numbers expressed like "0x0020" to be of word-size, while "0x20" would be of byte-size. Using complex expressions, we want to propagate this kind of "tainting" to any derived values, as well. It's basically the same problem, in this case propagating from the parsing level to any evaluations while using just a binary attribute. At the same time, as we have to determine the size of any instructions in a first pass, in order to have known addresses ready for the second pass, which actually generates the object code, it's somewhat similar to the problem of assuring correctness at compile time.

There are lots of real-world problems involving any kind of parsing, where an approach like this is help- and useful.

menzoic · on June 28, 2021

> Perhaps the most famous example is figuring out passwords by measuring the time it takes for a program to reject candidate passwords. The longer rejection takes, the longer password prefix can be deduced

That's odd. Wouldn't a password check be performed on the hashed version of the plaintext password? Words that share prefixes shouldn't produce hashes with the same prefix. There shouldn't be any difference in the time it takes wether the prefix is wrong or correct.

AlotOfReading · on June 28, 2021

The hash comparison can still be susceptible to timing attacks. Here's a historical example that affected Java: https://codahale.com/a-lesson-in-timing-attacks/

I was really hoping this post would include a demonstration of type-level safety against such attacks given the mention in the intro. It's not obvious to me how the demo would be extended to cover such timing leaks. I suspect that even with type-level safety, you're still at the mercy of the optimizer and language runtime to maintain those security properties. Can you specify constant time codegen or other constraints with GHC?

mathetic · on June 28, 2021

Sorry, I didn't mean to mislead. I was exemplifying ways of subtly leaking data. In the next paragraph, I clarify what we cover with the following:

> Namely, we implement a type system that regulates explicit and implicit dataflows as well as non-termination as a covert channel.

Yes, there is work on preventing timing attacks using static types. One is "A Hardware Design Language for Timing-Sensitive Information-Flow Security" which addresses exactly this problem. The second line of work is resource analysis. There are type systems that can specify the complexity of the program. Check out for example relational cost analysis [0]. This can be used for privacy and security purposes.

Needless to say, this is considerably more sophisticated than what I covered.

[0] https://dl.acm.org/doi/10.1145/2694344.2694372 [1] https://dl.acm.org/doi/10.1145/3009837.3009858

menzoic · on June 28, 2021

That codahale article you linked is surprising. Is that just a toy example or is it actually viable? I wouldn't think a "cryptographic" hash of one input sharing any prefix with another input would leak any information about the plaintext due to the avalanche effect. I thought that was possible for encrypted values but not hashed values.

chrismorgan's comment about plaintext only is what I would've assumed the attack was limited to for passwords.

vitus · on June 28, 2021

It's fundamentally the same attack as the plaintext case, although here it's breaking signature verification.

Here, the computation is

    real_sig := Sign(msg)
    Cmp(real_sig, provided_sig)

For a fixed message, Sign(msg) is constant time. So, with enough attempts you can create a high-resolution timer to tease apart subtle timing differences in Cmp.

You're correct that if the computation is

    provided_hash := Hash(pepper + pw + salt)
    Cmp(real_hash, provided_hash)

then the comparison is Hard (tm) to break.

(But this relies upon transmitting the plaintext password, you might say! If you transmit the hash directly, then you're again hosed. One option I've seen previously is transmitting a hash of the password, then hashing + salting it once more, which relies on avalanche effect as you suggest. I provide no guarantees re: the security of such an approach, though -- talk to an actual cryptographer!)

a1369209993 · on June 28, 2021

> One option I've seen previously is transmitting a hash of the password, then hashing + salting it once more, which relies on avalanche effect as you suggest.

If the client-side hash is fixed, this is no more secure than transmitting the plaintext password, since as far as the server is concerned, it is the password. You could, however, send a nonce to the client, send back H(nonce++password), calculate H(random++Hsent), and compare that to H(random++H(nonce++password)) calculated on the server.

> talk to an actual cryptographer

I am not your lawyer^Wcryptographer and this is not legal^Wcryptographic advice.

daef · on June 28, 2021

Doesn't this imply the server knowing the plaintext password?

a1369209993 · on June 28, 2021

Yes. You could (probably should) add a additional password=PBKDF(real_password) if that's a problem (eg because of password reuse), but it's not a essential feature from a cryptographic perspective.

AlotOfReading · on June 28, 2021

I brought it up mostly because I was hoping someone would point out how typing can easily fix such timing attacks, but yeah that was a legitimate issue. [1] shows a walkthrough of one attack.

In the particular case of password verification, I don't think it'd be that bad. It simply gives attackers more information than they should have. The issues are obvious if the hash doesn't have perfect diffusion. It also allows you to begin reconstructing the hash and performing offline attacks e.g. to avoid (already too-lax) rate limits.

https://medium.com/@TamasPolgar/can-timing-attack-be-a-pract...

vitus · on June 28, 2021

One thing that bothered me from an initial skim of that medium post:

> Even if the difference is only 0.0011 nanoseconds, the test is reproducible and each gives similar results.

From the graphs, this should read 0.0011 _milliseconds_. 0.0011 ns is eons faster than any CPU clocks out there (roughly 900 GHz), and would take a heck of a lot more samples :) Meanwhile, 0.0011 ms is on the order of a microsecond, which is very doable.

(other than that, everything else checks out.)

chrismorgan · on June 28, 2021

This particular timing attack is specifically for plain-text comparison, not cipher-text comparison.

menzoic · on June 28, 2021

That makes sense. Yet another reason not to store plaintext passwords at rest.

samus · on June 28, 2021

When comparing hashes, you can still have timing differences. You have to actually compare the whole string, drag the comparison result along and evaluate it at the very end. And make sure the compiler is not "helping". But this would give no knowledge of how long the matching prefix is. The hash should indeed mask that.

titzer · on June 27, 2021

That's great, but microarchitectural sidechannels in all modern hardware allow a program to read every byte in its entire address space.

gwbas1c · on June 28, 2021

You're confusing two different concepts.

I once encountered a situation where a junior programmer just wrote the password to the log file.

The point of using a separate type for secure data is to allow the compiler to try to trap situations where sensitive data is used incorrectly. (Yes, a dedicated moron could probably work around this kind of trap, but there is no such thing as a 100% perfect safety net.)

Trying to prevent data leaks at the hardware level is a completely different problem with a completely different solution.

wyldfire · on June 27, 2021

That's a risk that should be independently assessed. This could mitigate accidental design decisions that would emit data that's not intended to be revealed. It would force designers to explicitly cast data as having been transformed to be safely emitted.

mathetic · on June 27, 2021

Wouldn't that require the programmer to be adversarial?

Since all type systems have escape hatches, I don't think they would be resilient at all in adversarial settings. This is more geared towards detecting accidental and _subtle_ data leakage.

andrew_ferr · on June 27, 2021

There is some research on preventing microarchitectural timing channels with information flow control as well. https://dl.acm.org/doi/10.1145/3243734.3243743

staticassertion · on June 27, 2021

Every program can already read its entire address space. What you're probably trying to refer to is when a subroutine is hosted within a virtual machine, given a side channel the subroutine can read the address space of the VM host.

This is an extremely niche scenario, basically it's just browsers. Most people don't accept arbitrary programs to execute in their address space.

jhgb · on June 28, 2021

> Every program can already read its entire address space.

Even on Harvard architecture machines?

titzer · on June 28, 2021

Memory-safe languages are the basis for language VMs' isolation of programs. Programs in those languages generally cannot read the entire address space, unless they have unsafe hooks.

Micro-architectural side-channels bypass language security mechanisms altogether.

staticassertion · on June 28, 2021

You're just saying what I just said back to me, except with the confusing assertion about memory safe languages being the basis for language VMs isolation. I'm clearly missing something.

edit: > Programs in those languages generally cannot read the entire address space, unless they have unsafe hooks.

Also, this isn't true? Unless we're splitting hairs about semantics (ie: "read that exact address that isn't allocated yet). Lots of memory safe languages can allocate and compute arbitrarily, and there is no isolation of what subroutines can or can't read the memory of others.

Anyway IDK what points you're trying to make, this all seems very specific to VMs, and this post is tackling other problems.

vlovich123 · on June 28, 2021

Maybe depends on your definition of VM? I think the way you phrased it, I read it as VM in the sense of host guest (eg QEMU, VMware, etc).

Spectre and meltdown were broader. So in this case, if you loaded a script into your own process and executed in its own sandbox (eg Lua, JavaScript, Python, etc) that script could now access the containing process’ memory space whereas without spectre and Meltdown it couldn’t. Those attacks are still relevant and browser vendors are trying to be very careful here to not reintroduce gadgets that could be used to resurrect the attack. It’s also not just browsers. Notepad++ had to patch some things in its plugin system to protect against spectre/meltdown if I recall correctly.

In fact, the original spectre and meltdown vulnerabilities impacted the kernel on the same machine so process A could read process B’s memory although to my knowledge that’s been largely mitigated. You do have the Rowhammer attacks and those have not been mitigated so all these attack vectors are still a concern.

a1369209993 · on June 28, 2021

> In fact, the original spectre and meltdown vulnerabilities impacted the kernel on the same machine so process A could read process B's memory

Not quite; process A could read kernel memory, but reading process B's memory requires a further security problem[0] such as the kernel having the entire physical memory space identity mapped (for Meltdown or for Spectre with gadgets) or the kernel running ('near' enough to speculate) specter-exploitable gadgets when context-switched to process B (for Spectre).

0: It argubly wasn't a vulnerability until Meltdown and Spectre entered the threat model, but it evidently should have been, since they in fact did.

staticassertion · on June 28, 2021

I'm talking about any program that provides an execution environment for other, untrusted programs.

titzer · on June 28, 2021

> this post is tackling other problems.

I know we are off-topic, I was replying to your comment to clarify some misconceptions.

No, memory-safe languages do not let programs read the entire address space. Memory-safe languages include static (and dynamic) type safety measures so that a datum of type A cannot be treated as a datum of type B (if they are incompatible types). That means if different subroutines work on different types, they cannot read each others' data. That also implies that the heap has separable structure; your routine cannot get to my objects unless it has a type-safe access path to them. It cannot hunt through memory to find my crypto keys, e.g.

Therefore, security measures can be designed into programs that use the type system. The OP is an example of one.

> You're just saying what I just said back to me

Well, no.

> with the confusing assertion about memory safe languages being the basis for language VMs isolation. I'm clearly missing something.

Sorry if that confused you, it might have been cryptic. If you read what I've written in this comment it will be more clear now. Language VMs separate programs from each other because they can enforce a type system. Separate types in separate programs means they cannot interfere.

staticassertion · on June 28, 2021

Got it, yeah I think what you're saying now is a lot clearer to me, I definitely felt like I must have been totally missing something.

lapinot · on June 28, 2021

This kind of typing is quite cool, but i didn't found it so well explained.

The general pattern is to type programs not only with a type, but also with some "quantity", that quantity putting constraints on what is inside. Using quantities like "zero", "one" and "lots", you can get linear or affine typing, using quantities "public" & "private" you can get the kind of data leakage typing as presented, but you could easily also do classification levels (secret, top secret etc). I'm bluffing but i believe that this can also model rust-like lifetimes. What you want is that your quantities are a partially ordered semiring (or something slightly weaker): sum of quantities (eg a + b) will be used to compute the quantity of conjunction, eg a tuple; multiplication of quantities will be used to compute composition, eg for functions, if it takes an argument of quantity one, to a result of quantity q, then for an argument of quantity a, it will return a result of quantity a * q. And obviously the ordering relation is used for subtyping, eg "private" <= "public".

https://bentnib.org/linear-metatheory.html (a non-dependently-typed version of the thing below)

https://bentnib.org/quantitative-type-theory.html

didibus · on June 28, 2021

> This is safe because whoever has access to the more private data would also have access to the public one.

I think in real world scenarios, this would not be true of most stakeholder requirements.

That's what actually confused me here, because I don't see the use case in having levels of privacy.

I can see value in a boolean private/not private, for security, where private data you'd expect to always be access logged, stored securely (encrypted), transfered over secure channels, etc. And you wouldn't ever want this data to accidentally be leaked and end up stored/exchanged in a non-secure place (like unencrypted app logs).

But for privacy, I feel that you'd want to tie the data to some permission control of some sort, and levels don't make sense for it. You'd want something like "user making request", and I don't know if you can embed that in types and check things statically, since it very much seems like a runtime thing.

mathetic · on June 28, 2021

I used natural numbers as a simplification. In full generality, you can use a lattice where min and max are replaced by meet and join operations. Then you can have categories that are not comparable to each other.

As for privacy, you definitely have security levels. For example, think of employee record system in a simple company. The employee, payroll, human resources, and general public (perhaps accessing the system through the public-facing portion of their website) will all have different levels of access to information. You probably have the following lattice of access.

          employee
       /           \
      /             \
  payroll        human resources
     \              /
      \            /
          public

These security levels would make sense because public would have the least access (public bio perhaps), payroll would need access to things like your salary, human resources would need access to a record of disciplinary action, and you'd have access to all the information as you probably inputted most of it. This is one of many possible structures a company might have. I'd argue access to employee information is definitely a matter of privacy.

That said you're right that there is a dynamic nature of this discussion. What happens if access depends on my geographic location? That information will be stored in a database. This information is dynamic, but you can still use static checking. It'd look something like the following:

  location := getLocation(usedID);
  match location with
  | UK -> ...
  | US -> ...

Now, you don't know what location you have, but once you pattern match on it, you can check `...`s according to the correct level statically because you know that you can never execute that code if the location was elsewhere. So we use dynamic checks to reveal static information.

The big type-theoretic idea here is "existential types" that packs data and not reveal information until it is unpacked.

Hope that was helpful. That was a great great question.

didibus · on June 28, 2021

Thanks for taking time to answer, if you can have something more akin to permission categories, that makes it more interesting to me.

By dynamic, in your example, I'm meaning that for example as an employee I shouldn't have access to other employees data, only my own. I don't see how that can be handled statically.

Also, in the code for your lattice example, I guess you'd have a different page for each one? So like one page for human-resource and the variables in the code for that page would be of security-level human-resource. So in that way I get it, you can't accidentally for example on the public profile page add a payroll level information, because assigning from the payroll to the public page would error. But how do you now control access to which page the user has access too? Maybe for this you still need a dynamic permission system on top? And the compile checks only prevent programmer accidentally showing the wrong data on the wrong page?

Lastly, I still don't know if I buy into "levels". From my experience in enterprise systems, requirements always end up that there's no real hierarchy, just various overlapping sets. For example, it's very likely that human-resource would need to see data which the employee shouldn't see, like performance reviews, HR violations about the employee, compensation targets, etc. Similarly I can imagine payroll needing to see some company related financial with regards to the employee pay which the employee shouldn't have access too as well.

That's where I feel levels wouldn't meet requirements, since you rarely have scenarios where one category is a strict superset of another.

I can even imagine this causing leaks, as a programmer might get confused and think: "employee is the highest level, so I guess all those fields I added to the HR page should also go on employee and the assignment succeeds since employee is highest level of privacy, perfect I can leak no data". And now you've leaked private internal company data to the employee.

So here I think what be cool is if I could instead say that some function returns a value of security [payroll, employee] and another function returns a value of [payroll], while yet another returns [employee]. And now a [payroll, employee] can be assigned to a [payroll] or a [employee] or a [payroll, employee], but you can't assign a [payroll] to a [payroll, employee].

hansel_der · on June 28, 2021

> I can see value in a boolean private/not private

But this is an illusion. Privacy is always about the level of effort required, i.e. cost.

didibus · on June 28, 2021

Sorry, I'm not sure I'm following you on that one with regards to the context of the article and my comment.

m01 · on June 28, 2021

> I don't see the use case in having levels of privacy.

The GDPR for example has different requirements for "special categories of personal data" (see Article 9 of the GDPR). If you handle that kind of personal data as well as other personal data, you may find it useful and/or necessary to distinguish between the two.

lampe3 · on June 28, 2021

And then these typed programs store everything in a public S3 Bucket.

Aeolun · on June 27, 2021

I kind of like these articles, it seems like there might be some helpful message hidden under the arcane incantations, but I always bounce because they require me to learn a whole now dialect.

mathetic · on June 27, 2021

My bad, sorry =)

The share from this story is that we can enforce privacy just as we enforce more traditional types at no runtime cost. It scales because it is compositional and compiler can guide us through programs that won't leak data.

A more niche outcome is that Haskell's traditional type system is rich enough to encode this privacy enforcing type system.

I think if you just ignore the syntax, you can still gain a good intuition about how data might leak in subtle ways and what is sufficient to prevent it.

CTmystery · on June 27, 2021

I've been programming for a while, but can't make heads or tails of the 'simple imperative language'. What is the prereq for understanding this?

mathetic · on June 28, 2021

The language is just assignment to simple variables (like `=` in C), sequencing (like `;` in C), if-then-else statements, and while loops.

The simple version uses algebraic data types (present in Rust, Haskell, OCaml to name a few) to describe the abstract syntax tree of the language.

The inference rules [0] (the bits that have lots of dashes in them) are standard notation for describing logic and type systems.

The rest of the post uses more advanced types, particularly generalised algebraic data types [1] which allow you to have interesting types that reveal information to the typechecker when you pattern match on them.

If you really want to get into type systems and programming language theory, Types and Programming Languages [2] is a classic.

If you just want to understand the privacy ideas, I describe what the rules are supposed to do in detail, so you can gloss over the actual syntax. The only thing that part adds is that I can make Haskell's typechecker prevent construction of privacy violating programs.

If anything is still unclear, feel free to reach out. I'm happy to explain and elaborate.

[0] https://en.wikipedia.org/wiki/Rule_of_inference

[1] https://en.wikipedia.org/wiki/Generalized_algebraic_data_typ...

[2] https://www.cis.upenn.edu/~bcpierce/tapl/

CTmystery · on June 28, 2021

Very helpful, thank you!

ok123456 · on June 28, 2021

Knowing Haskell and GADTs.

mappu · on June 28, 2021

Seems like a similar idea to taint tracking in Perl.

mathetic · on June 28, 2021

The goal is the same but the approach is different. This is entirely compile time and it covers all executions of the program, whereas taint analysis done at runtime and consider only some executions.

Also taint analysis typically doesn't track implicit flows described in the post, but detects the explicit ones.

a1369209993 · on June 28, 2021

> This is entirely compile time and it covers all executions of the program, whereas taint analysis done at runtime and consider only some executions.

That is literally just the difference between static versus dynamic typing.

forgotmypw17 · on June 28, 2021

A great tool which has saved my ass many times.

Other languages should take heed.

Just -T and use strict; has saved me so many hours compared to, for example, PHP.

nv-vn · on June 27, 2021

I find the explanation for the sequencing rule rather unsatisfying. Why shouldn't the following be true?

         b : lb      d : ld
   -------------------------------
    a := b :>> c := d : max lb ld

Otherwise, it seems like you could leak info by just assigning two variables, where one is always public. E.g.

    x = user.firstName;
    y = random()

would be "safe."

(It's very likely that I'm missing something here)

mathetic · on June 27, 2021

Let's say we have `a, b, d : 0` and `c : 999`. Then `a := b :>> c := d` according to your rule is well-typed and has a security level `999`.

Now let's say I have a conditional expression `private : 999`.

The following program typechecks according to the rest of the restrictions. In particular, `999` from `private` is less than or equal to the `999` from the sequenced command.

  ITE private $
    a := b :>>
    c := d

But we just leaked data because if the value of `a` (which we can observe) is not `b`, then we know that `private` was 0 (aka. false). This should never happen because `a` is public and `private` is private.

The intuition here is that the lower the security level a command has the more public the conditional has to be. Otherwise, private data would *influence* public data which we can then observe to deduce something about the private data.

> it seems like you could leak info by just assigning two variables, where one is always public.

I didn't quite follow this bit. But if you elaborate, I'm happy to discuss.

nv-vn · on June 28, 2021

I think maybe the key idea I was missing here is that the type of a variable is not always the same as the type of the scope it was defined in? If I'm understanding correctly, there's an implicit rule that's missing:

    a : la   b : lb   a := b
   --------------------------
             a : lb

So for example if `b : lb`:

   a := b :>>
   doSomething()

need not have a type `>= lb`.

Does that sound vaguely correct?

Btw thanks for clarifying

tophattom · on June 28, 2021

This reminds me of a paper by Alejandro Russo "Functional Pearl: Two Can Keep a Secret, If One of Them Uses Haskell" [1].

[1]: https://www.cse.chalmers.se/~russo/publications_files/pearl-...

naveen99 · on June 28, 2021

I wonder if the type information itself is typed ? The abstraction probably (must?) breaks down somewhere.

letitbeirie · on June 27, 2021

Wasn't all the data being exfiltrated by the SolarWinds hack encoded into the copious amount of (seemingly-)random garbage emitted by .NET Framework? Pretty sure C# is typed.

didibus · on June 28, 2021

The title is a bit clickbait, as the article isn't saying that statically typed languages don't leak data, they are instead showing a hypothetical language which has a special type checker where types also encode privacy levels and the type checker asserts at compile time that no variable of a higher privacy type is assigned to a variable of a lower privacy type (amongst other things).

The article is really talking about research advancement in information flow control checks at compile time using advanced type systems, and no real language.

mathetic · on June 27, 2021

That is a different notion of typing. The type systems you are referring to (which is how they are commonly understood) classify data according to the operations they support. In this post, types classify privacy. These two notions are orthogonal and can be combined if needed. There are many many more notions of typing out there achieving wildly different goals.

recursive · on June 27, 2021

> Pretty sure C# is typed.

Not in the way that the word is used in the link.