A Primer on Type Systems

DonaldPShimoda · on Jan 30, 2019

I really liked this brief overview of type systems overall, but I especially liked that the author took the effort to move away from the terms "strongly typed" and "weakly typed". Too often these terms cause confusion. Even worse is when people confuse them for "statically typed" and "dynamically typed" (respectively), leading to people making claims like "Python is weakly typed" and "C is strongly typed" (which, depending on your definitions, are not true statements).

This material wasn't covered in my core curriculum for my CS degrees, which I think is a shame. Some students might find this stuff interesting, and I think introducing it to undergrads at least a little bit and saying "By the way, this is a thing you can research" could help some students with breaking into the academic side of CS.

msla · on Jan 30, 2019

It's more interesting to answer two questions:

1. Can I easily subvert this type system?

2. Can I make this type system map to concepts in the problem domain?

In C, the answer to 1 is "yes" and the answer to 2 is "to a limited extent". In fact, in C, you can subvert the type system so easily that it's considered impossible not to, to the extent that using the standard library requires throwing type information away, by using pointers-to-void or generic "char" buffers, and as for the second... well, Apps Hungarian notation exists for a reason. (Systems Hungarian exists for a reason, too, but I'll try to keep the insults out of this post.)

https://en.wikipedia.org/wiki/Hungarian_notation

The first question is interesting because it tells you if the type system can be used to enforce invariants, and the second question is interesting because it tells you if those invariants will be worth enforcing. If you're writing in Standard Pascal, and your type system enforces array length as a part of array type, well, to quote a metal album: "So Far, So Good... So What?" What does that tell me about whether my program is semantically correct as regards the problem domain? Similarly with enforcing size specification types: Differentiating between a short, an int, and a long tells me Sweet Fanny Adams about what values of those sizes mean in my program. Typing isn't "strong" if it's enforcing rules which ultimately make no sense, and if it's "sound", well, mathematical proofs of something I'm not interested in are themselves uninteresting.

dllthomas · on Jan 30, 2019

I'd extend 1 to include "can I easily avoid subverting this type system?" Guarantees are better, but as long as I can get help that's enough for a tool to be useful assuming 2.

Regarding 2, I've long complained that the conception of types embedded in many systems (and many heads) is very representational and most of the time we don't actually care about representation. The exceptions I typically mention are where we care a lot about performance, and at system boundaries. We can fit those into your analysis by asserting (reasonably, I think) that in those instances representation starts to be a part of your problem domain.

AnimalMuppet · on Jan 30, 2019

I'd also extend 1 to include "can I subvert this type system when I really need to?"

(And to those who would say "You never need to", I don't think you can be quite that dogmatic. The real world includes a lot of unforeseen scenarios...)

munk-a · on Jan 30, 2019

Aha! I will counter that you never actually want or need to break the type comprehension of your executable but the static analysis of the compiler or dynamic analysis at runtime may be inaccurate with regards to the truth and motivate you to try and subvert it. This may be about what you were thinking when you wrote your comment but I wanted to highlight a small but important difference. I may find a useful bitwise function that takes arguments of type int and really want to execute the same bitwise function on a pair of doubles, in this case the actual type of data I want the function to operate on is "a series of bytes" but the compiler/whatever's typing system may force me to declare the function as taking an IEEE 754 or two's complement integer instead.

In these cases the typing system is constraining you from declaring a type safe operation the right way by forcing you to play within a subset of the universe of valid types.

AnimalMuppet · on Jan 30, 2019

Imagine that you have a binary-only third-party library, one function of which returns an opaque type. Imagine that you want to unpack that and observe the state of some data inside it (perhaps guided by online hints), and the vendor won't give you the ability to do so because 1) they're busy, or 2) they want a lot of money, or 3) they're out of business. You don't want to have to subvert the type system, but you still kind of need to.

And if you're going to tell me that the only reason you need to in this scenario is because the third-party vendor didn't do their job (including creating the types) right, I'd agree. That doesn't actually change what you need to do, though.

munk-a · on Jan 30, 2019

I agree, but in that case you're fighting against the typing system more than typing itself, whatever you are doing with that thing and however you would abstractly define that type (maybe as "We expect this thing to have the type of the thing that terrible third party library usually gives us back") you will make assumptions based on the fact that that value is actually one of those and you never want those assumptions violated. In this case the typing system is too restrictive and it might be a case where duck-typing would come in handy (since you can usually post-apply types in that sort of a setting).

dllthomas · on Jan 30, 2019

It's an important question. Systems where you can't ever subvert the type system may be useful in some places (sandboxing) where systems with an escape hatch aren't, and maybe vice-versa (plausibly "in theory", certainly "in practice" IME). The two might be variants of the same underlying language, though - Safe Haskell is at least an attempt in this space; I don't know how well it delivers.

ggm · on Jan 31, 2019

I've always felt type coercion is a pragmatic recognition of a lack of time, but also a lack of analysis and abstraction. If you are type coercing in an existing body of code, you probably walked into an ill-understood quirk in the key information model, or a change in circumstance.

If you are routinely coercing types in new code, you haven't done enough design and analysis. Something is off-key.

bogomipz · on Jan 30, 2019

>"well, Apps Hungarian notation exists for a reason. (Systems Hungarian exists for a reason, too, but I'll try to keep the insults out of this post.)"

Can you elaborate? I read the wiki page on Apps Hungarian. I am not following. How does/would this address issues with the type system in C?

>"Can I make this type system map to concepts in the problem domain?"

I didn't really follow your Pascal example explanation of this, you first mentioned "because it tells you if the type system can be used to enforce invariants." But the went on to talk about array lengths. How does whether or not the invariants are enforced by the type system help you answer the question of whether or not it maps to the problem domain? Might you be able to provide another example? Thanks.

lmkg · on Jan 30, 2019

(Not the author of the post you're replying to)

Apps Hungarian is, to a first approximation, a way for your program to use a type system not supported by the language. It's an ugly hack, in no small part because the type-checking is done visually by the programmer, rather than automatically by the compiler. People only use it if there's substantial value in the type system they're bringing in... which generally only happens if the language's type system is inexpressive.

Regarding Pascal, it's important to note that the array length is not an optional part of the type. An array of length 10 is type-incompatible with an array of length 20. A function whose input is an array must specify the length of the array it accepts, and the function cannot be called with an argument of different length. I assume the author's overall point is that this type system, while strong, is not useful for meaningful work.

msla · on Jan 30, 2019

> Can you elaborate? I read the wiki page on Apps Hungarian. I am not following. How does/would this address issues with the type system in C?

Apps Hungarian allows you to encode information about the value a variable contains which you can't capture in the type system.

For example, imagine you're writing a program which handles user input. You need to distinguish between Sanitized Strings and Unsanitized Strings because if you don't, you open yourself up to security problems. Absent a way to extend the type system to put this information in a type, you do this:

    char *usStr; /* An Unsanitized String */

    char *snStr; /* A SaNitized string */

You add a couple letters to the beginnings of the variable names to encode what the type system doesn't.

You can see that if you have a line of code which says:

    snStr = usStr;

it is wrong, and your brain can print a full-color warning message with highlighting.

Bottom Line: Apps Hungarian makes it easier for you to be the type checker.

> I didn't really follow your Pascal example explanation of this, you first mentioned "because it tells you if the type system can be used to enforce invariants." But the went on to talk about array lengths. How does whether or not the invariants are enforced by the type system help you answer the question of whether or not it maps to the problem domain? Might you be able to provide another example? Thanks.

Array lengths are an invariant. They're just one I don't care about, because nothing in the problem domain is modeled by how long a given array is. The contents of those arrays are much more important, and the types should vary based on that, instead.

A somewhat simplistic example:

You have a program which prints travel itineraries for people who must be addressed correctly along the way, where correctly means using their prenomial and postnomial titles. For example "Dr. Mary Richards, Ph.D." would be insulted to be merely "Mary Richards" or "Dr. Mary Richards, MD". You also have to print route information, which is a list of towns and states, like "Baltimore, MD".

Therefore, you have to ensure three things: Prenomials are always prepended to names, postnomials are always appended to names, and town names are always suffixed with state abbreviations. Oh, and if you print a name without prenomials and postnomials, you'll not escape unscathed.

Wouldn't it be nice if the type system were capable of keeping track of which strings were prenomials, postnomials, personal names, town names, and state abbreviations, to ensure you never try to append a state abbreviation onto a personal name, and never try to print a personal name alone? Those are the kinds of interesting invariants I'm talking about.

bogomipz · on Jan 30, 2019

Thanks for the thorough explanations. This is really helpful. Cheers.

munk-a · on Jan 30, 2019

Interestingly "statically typed" and "dynamically typed" are becoming less exclusive, there are code sniffers that can run static analysis of code to detect potential type incompatibilities during inspection and PHP even now has a group of parse errors that will be raised if incompatible classes definitions are detected i.e.

    class Foo { function do(): int { return 1; } }
    class Bar extends Foo { function do(): string { return 'a'; } }

while PHP will also do dynamic type checking during execution. I don't think this is surprising since static type enforcement has never been a requirement of a compiler/interpreter of a executable code page without run-time type checking - it's just a sort of bonus utility that is packaged into the process to make things easier. Whether your static code analyzer is in gcc or is just some separate utility it all works out in the end, so any sort of code linters are essentially doing a static type analysis (among other things, and usually language constraints prevent this analysis from being as much of a solid proof as languages designed to specifically be statically typed).

So I pretty strongly disagree that typing systems can either be static or dynamic, because I think the separation of manifest vs. implicit is a false one, lots of languages (again) support partially explicit type declarations while allowing some sort of support for a generic declaration.

sifoobar · on Jan 30, 2019

Snigl [0] has statically typed function signatures, struct fields, bindings etc; but values still carry their type.

It also traces code before running it to eliminate as many type checks as possible.

The categories aren't very helpful from my perspective, same goes for compiler vs. interpreter. All it leads to is endless arguing about definitions and discouraging of novel approaches.

[0] https://gitlab.com/sifoo/snigl#types

munk-a · on Jan 30, 2019

I agree, and I don't think this is a bad trend, once upon a time I think static vs. dynamic was a clear distinction, that distinction is being worn down as we learn more about the value and costs of those approaches - and as that happens we're compromising the approaches to gain more of the value.

In PHP the (mostly) JIT interpretation exposed a weakness where infrequently executed pieces of code were harder to have confidence in. Unit tests and such have raised our confidence but explicit typing can also raise our confidence in an easier manner, both are good to have but having access to more tools just makes our lives easier.

In fact PHP can be pre-cached in bytecode now (which is basically just a traditional compiled approach) and in the default run mode PHP code files are interpreted and then cached to minimize the number of times code needs to be interpreted. The lines between compiled vs. interpreted are getting really fuzzy now and were pretty fuzzy as far back as Java bytecode (which is as far as my memory goes, others may have a better handle on earlier experiments in partial compilation)

naasking · on Jan 31, 2019

> I agree, and I don't think this is a bad trend, once upon a time I think static vs. dynamic was a clear distinction

I think there's still a very clear distinction, but there's often value in having access to both in one language. The dynamic typing features can often make up for limitations of your type system, for instance.

_0w8t · on Jan 31, 2019

C++ templates are not typed structually. The compiler does not assign types to them at all. Rather they are treated as macros and the type assignment and checking happens during the expansion. For example, one can write a template that has no valid expansion, but as long as nobody uses it, the compiler accepts it.

In sense templates are similar to dynamic languages. Except in C++ it is the compiler, not runtime, that evaluates the templates.

trurl · on Jan 30, 2019

> There does not seem to be any equivalent of the notion of soundness in the world of dynamic typing. However, we can still talk about whether a dynamic type system strictly enforces type safety.

My understanding is that there are well understood definitions of soundness for dynamically typed languages. The soundness theorem will generally be weaker ("your program will evaluate to a value or abort on a type confusion") but it is certainly possible to be sure a given dynamically typed language definition will never make mistakes like confusing an integer for a pointer. Though in the scheme of things such theorems are not much weaker than what you get for a statically typed language with exceptions ("your program will evaluate to a value with the given type or throw some exception").

naasking · on Jan 31, 2019

It actually is significantly "weaker". You make the wrong analogy when you say the dynamically typed case maps to a statically typed one with an exception (which presumably is a value in your language).

The correct analogy is to a segfault, meaning the program exited abnormally because it's inconsistent in an important way. It's the reason statically typed languages are faster than dynamically languages: expressions have a fixed meaning than enables optimisation guarantees you can depend on.

kd0amg · on Jan 30, 2019

That is a "soundness theorem" and maybe even a "[static] type soundness" theorem (if you like the word "unityped"), but the point is that's not really saying anything about conclusions drawn by the classification of values into dynamic types.

gryfft · on Jan 30, 2019

As someone who is extremely unfamiliar with category theory, types, and C++, something is bothering me that I would love to have clarified.

Polymorphic function templates are introduced with addEm:

  template <typename T, typename U>
  T addEm(T a, U b)
  {
      return a + b;
  }

It seems to me that this is incorrect. From the Wikipedia page on C++ templates [1], a polymorphic function template is described like this:

  template <typename T>
  inline T max(T a, T b) {
      return a > b ? a : b;
  }

The first example seems to define a template so that the function takes two arguments of different types, then relies on the compiler to reject inputs for `+' of types not defined for it. (This does not seem to track with the explanation following it, which indicates that addEm should take two arguments of the same type.)

The implication in the first example seems to be that if you want your template to be compatible with more than one type, you must explicitly indicate so in your template parameters. However, the Wikipedia entry seems to indicate that C++ implicitly instantiates object code for as many types as the template is called on at compile time, a function instantiated for each of the types (addEm<string>, addEm<int>, addEm<double>, etc.), and a compilation error resulting if inputs of different types are passed to the template.

Am I missing something? For reference, here is how I would intuit the first example should look, based on my current understanding.

  template <typename T>
  T addEm(T a, T b)
  {
      return a + b;
  }

I'll add that given my lack of knowledge of C++, it could well be that the template syntax allows types T and U to be the same in a particular instantiation of the template, but my understanding is that at least they can be different, and why would you intentionally do that in this example?

1. https://en.wikipedia.org/wiki/Template_(C%2B%2B)

AnimalMuppet · on Jan 30, 2019

I believe that your understanding is correct.

   template <typename T>

is a template with one "type parameter" (may not be the correct technical term), whereas

   template <typename T, typename U>

is a template with two type parameters, and they can be different. (That's why you give two parameters, so that they don't have to be the same.)

gryfft · on Jan 30, 2019

I appreciate the sanity check with a depth and viscerality I can't properly express.

I also apologize to you and anyone reading for any inaccurate terminology on my part.

styfle · on Jan 31, 2019

I have to post this parody lightning talk because Gary Bernhardt is hilarious and the slides demonstrate some of the (exaggerated) confusion around static vs dynamic types.

https://www.destroyallsoftware.com/talks/useing-youre-types-...

g33k247 · on Jan 30, 2019

I found this article helpful. Since C is a "strongly typed" language it helps to review what is meant by type. In addition, like #DonaldPShimoda, I liked that the author challenged to the existing notions of type. Good read.

ternaryoperator · on Jan 30, 2019

By all definitions, C is not strongly typed.

The author just points to selected links, without going to the links themselves. If you go to the link he points you to [1] where C typing is defined, you find that two of the three links there that claim C to be strongly typed are broken and the third link immediately qualifies the assertion with: "To be precise, relatively strongly-typed app-specific dialects of the not-very-strongly-typed language, C."

[1] http://wiki.c2.com/?StronglyTyped

whage · on Jan 30, 2019

The way the author describes static and dynamic typing seems to imply that static == compiled and dynamic == interpreted. I probably misunderstand something, can someone help me clear things up?

zozbot123 · on Jan 30, 2019

Dynamic typing boils down to runtime tagging of any value that could be referred to by a "dynamically-typed" variable. That's not "interpreted" in a strict sense, but it requires dispatching on every single value access, which will definitely affect performance.

pavon · on Jan 31, 2019

That is typically the case, but it doesn't have to be. You could have an interpreted language that determines all the types when the code is first parsed and generates errors before it starts executing. You can also have a statically compiled language that defers (some or all) type checking until runtime.

amyjess · on Jan 30, 2019

This is an excellent article.

The only thing I'd want to see added to it is a section on gradual typing, a la Perl 6.