While security is – of course – a noble cause, this approach has many uses beyond that.
E.g., I just upgraded an old 6502 assembler to support complex value expressions, including variables. Now, the 6502 has two distinct address modes for accessing memory (r/w), a fast single-byte mode dedicated to so-called zero-page addresses (like 0x0020), and a normal one for word-size address operands. (The "zero-page" may be regarded as a bank of 256 auxiliary registers.)
A good assembler should determine the address mode automatically, and, at the same time, should provide means to express the wish for a value to be treated as a word-size address. A common convention is to assume numbers expressed like "0x0020" to be of word-size, while "0x20" would be of byte-size. Using complex expressions, we want to propagate this kind of "tainting" to any derived values, as well. It's basically the same problem, in this case propagating from the parsing level to any evaluations while using just a binary attribute. At the same time, as we have to determine the size of any instructions in a first pass, in order to have known addresses ready for the second pass, which actually generates the object code, it's somewhat similar to the problem of assuring correctness at compile time.
There are lots of real-world problems involving any kind of parsing, where an approach like this is help- and useful.
> Perhaps the most famous example is figuring out passwords by measuring the time it takes for a program to reject candidate passwords. The longer rejection takes, the longer password prefix can be deduced
That's odd. Wouldn't a password check be performed on the hashed version of the plaintext password? Words that share prefixes shouldn't produce hashes with the same prefix. There shouldn't be any difference in the time it takes wether the prefix is wrong or correct.
I was really hoping this post would include a demonstration of type-level safety against such attacks given the mention in the intro. It's not obvious to me how the demo would be extended to cover such timing leaks. I suspect that even with type-level safety, you're still at the mercy of the optimizer and language runtime to maintain those security properties. Can you specify constant time codegen or other constraints with GHC?
Sorry, I didn't mean to mislead. I was exemplifying ways of subtly leaking data. In the next paragraph, I clarify what we cover with the following:
> Namely, we implement a type system that regulates explicit and implicit dataflows as well as non-termination as a covert channel.
Yes, there is work on preventing timing attacks using static types. One is "A Hardware Design Language for Timing-Sensitive Information-Flow Security" which addresses exactly this problem. The second line of work is resource analysis. There are type systems that can specify the complexity of the program. Check out for example relational cost analysis [0]. This can be used for privacy and security purposes.
Needless to say, this is considerably more sophisticated than what I covered.
That codahale article you linked is surprising. Is that just a toy example or is it actually viable? I wouldn't think a "cryptographic" hash of one input sharing any prefix with another input would leak any information about the plaintext due to the avalanche effect. I thought that was possible for encrypted values but not hashed values.
chrismorgan's comment about plaintext only is what I would've assumed the attack was limited to for passwords.
It's fundamentally the same attack as the plaintext case, although here it's breaking signature verification.
Here, the computation is
real_sig := Sign(msg)
Cmp(real_sig, provided_sig)
For a fixed message, Sign(msg) is constant time. So, with enough attempts you can create a high-resolution timer to tease apart subtle timing differences in Cmp.
(But this relies upon transmitting the plaintext password, you might say! If you transmit the hash directly, then you're again hosed. One option I've seen previously is transmitting a hash of the password, then hashing + salting it once more, which relies on avalanche effect as you suggest. I provide no guarantees re: the security of such an approach, though -- talk to an actual cryptographer!)
> One option I've seen previously is transmitting a hash of the password, then hashing + salting it once more, which relies on avalanche effect as you suggest.
If the client-side hash is fixed, this is no more secure than transmitting the plaintext password, since as far as the server is concerned, it is the password. You could, however, send a nonce to the client, send back H(nonce++password), calculate H(random++Hsent), and compare that to H(random++H(nonce++password)) calculated on the server.
> talk to an actual cryptographer
I am not your lawyer^Wcryptographer and this is not legal^Wcryptographic advice.
Yes. You could (probably should) add a additional password=PBKDF(real_password) if that's a problem (eg because of password reuse), but it's not a essential feature from a cryptographic perspective.
I brought it up mostly because I was hoping someone would point out how typing can easily fix such timing attacks, but yeah that was a legitimate issue. [1] shows a walkthrough of one attack.
In the particular case of password verification, I don't think it'd be that bad. It simply gives attackers more information than they should have. The issues are obvious if the hash doesn't have perfect diffusion. It also allows you to begin reconstructing the hash and performing offline attacks e.g. to avoid (already too-lax) rate limits.
One thing that bothered me from an initial skim of that medium post:
> Even if the difference is only 0.0011 nanoseconds, the test is reproducible and each gives similar results.
From the graphs, this should read 0.0011 _milliseconds_. 0.0011 ns is eons faster than any CPU clocks out there (roughly 900 GHz), and would take a heck of a lot more samples :) Meanwhile, 0.0011 ms is on the order of a microsecond, which is very doable.
When comparing hashes, you can still have timing differences. You have to actually compare the whole string, drag the comparison result along and evaluate it at the very end. And make sure the compiler is not "helping". But this would give no knowledge of how long the matching prefix is. The hash should indeed mask that.
I once encountered a situation where a junior programmer just wrote the password to the log file.
The point of using a separate type for secure data is to allow the compiler to try to trap situations where sensitive data is used incorrectly. (Yes, a dedicated moron could probably work around this kind of trap, but there is no such thing as a 100% perfect safety net.)
Trying to prevent data leaks at the hardware level is a completely different problem with a completely different solution.
That's a risk that should be independently assessed. This could mitigate accidental design decisions that would emit data that's not intended to be revealed. It would force designers to explicitly cast data as having been transformed to be safely emitted.
Wouldn't that require the programmer to be adversarial?
Since all type systems have escape hatches, I don't think they would be resilient at all in adversarial settings. This is more geared towards detecting accidental and _subtle_ data leakage.
Every program can already read its entire address space. What you're probably trying to refer to is when a subroutine is hosted within a virtual machine, given a side channel the subroutine can read the address space of the VM host.
This is an extremely niche scenario, basically it's just browsers. Most people don't accept arbitrary programs to execute in their address space.
Memory-safe languages are the basis for language VMs' isolation of programs. Programs in those languages generally cannot read the entire address space, unless they have unsafe hooks.
Micro-architectural side-channels bypass language security mechanisms altogether.
You're just saying what I just said back to me, except with the confusing assertion about memory safe languages being the basis for language VMs isolation. I'm clearly missing something.
edit:
> Programs in those languages generally cannot read the entire address space, unless they have unsafe hooks.
Also, this isn't true? Unless we're splitting hairs about semantics (ie: "read that exact address that isn't allocated yet). Lots of memory safe languages can allocate and compute arbitrarily, and there is no isolation of what subroutines can or can't read the memory of others.
Anyway IDK what points you're trying to make, this all seems very specific to VMs, and this post is tackling other problems.
Maybe depends on your definition of VM? I think the way you phrased it, I read it as VM in the sense of host guest (eg QEMU, VMware, etc).
Spectre and meltdown were broader. So in this case, if you loaded a script into your own process and executed in its own sandbox (eg Lua, JavaScript, Python, etc) that script could now access the containing process’ memory space whereas without spectre and Meltdown it couldn’t. Those attacks are still relevant and browser vendors are trying to be very careful here to not reintroduce gadgets that could be used to resurrect the attack. It’s also not just browsers. Notepad++ had to patch some things in its plugin system to protect against spectre/meltdown if I recall correctly.
In fact, the original spectre and meltdown vulnerabilities impacted the kernel on the same machine so process A could read process B’s memory although to my knowledge that’s been largely mitigated. You do have the Rowhammer attacks and those have not been mitigated so all these attack vectors are still a concern.
> In fact, the original spectre and meltdown vulnerabilities impacted the kernel on the same machine so process A could read process B's memory
Not quite; process A could read kernel memory, but reading process B's memory requires a further security problem[0] such as the kernel having the entire physical memory space identity mapped (for Meltdown or for Spectre with gadgets) or the kernel running ('near' enough to speculate) specter-exploitable gadgets when context-switched to process B (for Spectre).
0: It argubly wasn't a vulnerability until Meltdown and Spectre entered the threat model, but it evidently should have been, since they in fact did.
I know we are off-topic, I was replying to your comment to clarify some misconceptions.
No, memory-safe languages do not let programs read the entire address space. Memory-safe languages include static (and dynamic) type safety measures so that a datum of type A cannot be treated as a datum of type B (if they are incompatible types). That means if different subroutines work on different types, they cannot read each others' data. That also implies that the heap has separable structure; your routine cannot get to my objects unless it has a type-safe access path to them. It cannot hunt through memory to find my crypto keys, e.g.
Therefore, security measures can be designed into programs that use the type system. The OP is an example of one.
> You're just saying what I just said back to me
Well, no.
> with the confusing assertion about memory safe languages being the basis for language VMs isolation. I'm clearly missing something.
Sorry if that confused you, it might have been cryptic. If you read what I've written in this comment it will be more clear now. Language VMs separate programs from each other because they can enforce a type system. Separate types in separate programs means they cannot interfere.
This kind of typing is quite cool, but i didn't found it so well explained.
The general pattern is to type programs not only with a type, but also with some "quantity", that quantity putting constraints on what is inside. Using quantities like "zero", "one" and "lots", you can get linear or affine typing, using quantities "public" & "private" you can get the kind of data leakage typing as presented, but you could easily also do classification levels (secret, top secret etc). I'm bluffing but i believe that this can also model rust-like lifetimes. What you want is that your quantities are a partially ordered semiring (or something slightly weaker): sum of quantities (eg a + b) will be used to compute the quantity of conjunction, eg a tuple; multiplication of quantities will be used to compute composition, eg for functions, if it takes an argument of quantity one, to a result of quantity q, then for an argument of quantity a, it will return a result of quantity a * q. And obviously the ordering relation is used for subtyping, eg "private" <= "public".
> This is safe because whoever has access to the more private data would also have access to the public one.
I think in real world scenarios, this would not be true of most stakeholder requirements.
That's what actually confused me here, because I don't see the use case in having levels of privacy.
I can see value in a boolean private/not private, for security, where private data you'd expect to always be access logged, stored securely (encrypted), transfered over secure channels, etc. And you wouldn't ever want this data to accidentally be leaked and end up stored/exchanged in a non-secure place (like unencrypted app logs).
But for privacy, I feel that you'd want to tie the data to some permission control of some sort, and levels don't make sense for it. You'd want something like "user making request", and I don't know if you can embed that in types and check things statically, since it very much seems like a runtime thing.
I used natural numbers as a simplification. In full generality, you can use a lattice where min and max are replaced by meet and join operations. Then you can have categories that are not comparable to each other.
As for privacy, you definitely have security levels. For example, think of employee record system in a simple company. The employee, payroll, human resources, and general public (perhaps accessing the system through the public-facing portion of their website) will all have different levels of access to information. You probably have the following lattice of access.
employee
/ \
/ \
payroll human resources
\ /
\ /
public
These security levels would make sense because public would have the least access (public bio perhaps), payroll would need access to things like your salary, human resources would need access to a record of disciplinary action, and you'd have access to all the information as you probably inputted most of it. This is one of many possible structures a company might have. I'd argue access to employee information is definitely a matter of privacy.
That said you're right that there is a dynamic nature of this discussion. What happens if access depends on my geographic location? That information will be stored in a database. This information is dynamic, but you can still use static checking. It'd look something like the following:
location := getLocation(usedID);
match location with
| UK -> ...
| US -> ...
Now, you don't know what location you have, but once you pattern match on it, you can check `...`s according to the correct level statically because you know that you can never execute that code if the location was elsewhere. So we use dynamic checks to reveal static information.
The big type-theoretic idea here is "existential types" that packs data and not reveal information until it is unpacked.
Hope that was helpful. That was a great great question.
Thanks for taking time to answer, if you can have something more akin to permission categories, that makes it more interesting to me.
By dynamic, in your example, I'm meaning that for example as an employee I shouldn't have access to other employees data, only my own. I don't see how that can be handled statically.
Also, in the code for your lattice example, I guess you'd have a different page for each one? So like one page for human-resource and the variables in the code for that page would be of security-level human-resource. So in that way I get it, you can't accidentally for example on the public profile page add a payroll level information, because assigning from the payroll to the public page would error. But how do you now control access to which page the user has access too? Maybe for this you still need a dynamic permission system on top? And the compile checks only prevent programmer accidentally showing the wrong data on the wrong page?
Lastly, I still don't know if I buy into "levels". From my experience in enterprise systems, requirements always end up that there's no real hierarchy, just various overlapping sets. For example, it's very likely that human-resource would need to see data which the employee shouldn't see, like performance reviews, HR violations about the employee, compensation targets, etc. Similarly I can imagine payroll needing to see some company related financial with regards to the employee pay which the employee shouldn't have access too as well.
That's where I feel levels wouldn't meet requirements, since you rarely have scenarios where one category is a strict superset of another.
I can even imagine this causing leaks, as a programmer might get confused and think: "employee is the highest level, so I guess all those fields I added to the HR page should also go on employee and the assignment succeeds since employee is highest level of privacy, perfect I can leak no data". And now you've leaked private internal company data to the employee.
So here I think what be cool is if I could instead say that some function returns a value of security [payroll, employee] and another function returns a value of [payroll], while yet another returns [employee]. And now a [payroll, employee] can be assigned to a [payroll] or a [employee] or a [payroll, employee], but you can't assign a [payroll] to a [payroll, employee].
> I don't see the use case in having levels of privacy.
The GDPR for example has different requirements for "special categories of personal data" (see Article 9 of the GDPR). If you handle that kind of personal data as well as other personal data, you may find it useful and/or necessary to distinguish between the two.
I kind of like these articles, it seems like there might be some helpful message hidden under the arcane incantations, but I always bounce because they require me to learn a whole now dialect.
The share from this story is that we can enforce privacy just as we enforce more traditional types at no runtime cost. It scales because it is compositional and compiler can guide us through programs that won't leak data.
A more niche outcome is that Haskell's traditional type system is rich enough to encode this privacy enforcing type system.
I think if you just ignore the syntax, you can still gain a good intuition about how data might leak in subtle ways and what is sufficient to prevent it.
The language is just assignment to simple variables (like `=` in C), sequencing (like `;` in C), if-then-else statements, and while loops.
The simple version uses algebraic data types (present in Rust, Haskell, OCaml to name a few) to describe the abstract syntax tree of the language.
The inference rules [0] (the bits that have lots of dashes in them) are standard notation for describing logic and type systems.
The rest of the post uses more advanced types, particularly generalised algebraic data types [1] which allow you to have interesting types that reveal information to the typechecker when you pattern match on them.
If you really want to get into type systems and programming language theory, Types and Programming Languages [2] is a classic.
If you just want to understand the privacy ideas, I describe what the rules are supposed to do in detail, so you can gloss over the actual syntax. The only thing that part adds is that I can make Haskell's typechecker prevent construction of privacy violating programs.
If anything is still unclear, feel free to reach out. I'm happy to explain and elaborate.
The goal is the same but the approach is different. This is entirely compile time and it covers all executions of the program, whereas taint analysis done at runtime and consider only some executions.
Also taint analysis typically doesn't track implicit flows described in the post, but detects the explicit ones.
Let's say we have `a, b, d : 0` and `c : 999`. Then `a := b :>> c := d` according to your rule is well-typed and has a security level `999`.
Now let's say I have a conditional expression `private : 999`.
The following program typechecks according to the rest of the restrictions. In particular, `999` from `private` is less than or equal to the `999` from the sequenced command.
ITE private $
a := b :>>
c := d
But we just leaked data because if the value of `a` (which we can observe) is not `b`, then we know that `private` was 0 (aka. false). This should never happen because `a` is public and `private` is private.
The intuition here is that the lower the security level a command has the more public the conditional has to be. Otherwise, private data would *influence* public data which we can then observe to deduce something about the private data.
> it seems like you could leak info by just assigning two variables, where one is always public.
I didn't quite follow this bit. But if you elaborate, I'm happy to discuss.
I think maybe the key idea I was missing here is that the type of a variable is not always the same as the type of the scope it was defined in? If I'm understanding correctly, there's an implicit rule that's missing:
a : la b : lb a := b
--------------------------
a : lb
Wasn't all the data being exfiltrated by the SolarWinds hack encoded into the copious amount of (seemingly-)random garbage emitted by .NET Framework? Pretty sure C# is typed.
The title is a bit clickbait, as the article isn't saying that statically typed languages don't leak data, they are instead showing a hypothetical language which has a special type checker where types also encode privacy levels and the type checker asserts at compile time that no variable of a higher privacy type is assigned to a variable of a lower privacy type (amongst other things).
The article is really talking about research advancement in information flow control checks at compile time using advanced type systems, and no real language.
That is a different notion of typing. The type systems you are referring to (which is how they are commonly understood) classify data according to the operations they support. In this post, types classify privacy. These two notions are orthogonal and can be combined if needed. There are many many more notions of typing out there achieving wildly different goals.
E.g., I just upgraded an old 6502 assembler to support complex value expressions, including variables. Now, the 6502 has two distinct address modes for accessing memory (r/w), a fast single-byte mode dedicated to so-called zero-page addresses (like 0x0020), and a normal one for word-size address operands. (The "zero-page" may be regarded as a bank of 256 auxiliary registers.)
A good assembler should determine the address mode automatically, and, at the same time, should provide means to express the wish for a value to be treated as a word-size address. A common convention is to assume numbers expressed like "0x0020" to be of word-size, while "0x20" would be of byte-size. Using complex expressions, we want to propagate this kind of "tainting" to any derived values, as well. It's basically the same problem, in this case propagating from the parsing level to any evaluations while using just a binary attribute. At the same time, as we have to determine the size of any instructions in a first pass, in order to have known addresses ready for the second pass, which actually generates the object code, it's somewhat similar to the problem of assuring correctness at compile time.
There are lots of real-world problems involving any kind of parsing, where an approach like this is help- and useful.