Hacker News new | past | comments | ask | show | jobs | submit login
Predicting Variable Types in Dynamically Typed Programming Languages (arxiv.org)
61 points by gauravanand25 6 months ago | hide | past | web | favorite | 36 comments

> ...allows programmers to write code quickly by not requiring to declare the types of each variable which are determined at run-time based on the values assigned to that variable, thereby increasing programmer productivity

Modern compilers for statically typed languages are really good at inferring types of various identifiers based on multiple hints in a deterministic way (see Kotlin, Swift etc). Mentioned statically typed languages C,C++ and Java are pretty old and therefore carry some baggage of verbosity that is no longer needed.

> ...since the variable types are not declared in the source code, the source code becomes difficult to understand and extend

> For programmers working on the large code base written in dynamic languages, it is hard to understand the control flow of the program if the types are not available at the compile time.

Dynamic languages as a result of their "dynamicness" tend to allow much better expression of control flow when compared to static languages. Only recently available statically typed languages have targeted expressiveness as a first class goal in designing the language. In fact, statically typed languages are notorious for obtuse control flows as a result of their type enforcement (see C, C++, Golang)

Yep I think working in a language with full inference and just annotating a few non-obvious types to help out whoever’s reading the code is a really nice pattern.

I learned functional programming at my last job with Ocaml, and did rely quite a bit on editor tooling to give the type of things which could have been annotated, so it’s definitely a tricky balance in teams.

But I use Scala now and I wish it inferred types half as well, giving almost every type is really verbose and does feel like Java sometimes. The compiler speed is probably an issue for providing the same kind of tooling there though even if you could avoid giving most types.

I guess I’m basically agreeing with you - the combination of an ultra fast compiler and complete type inference is amazing to work with.

> Modern compilers for statically typed languages are really good at inferring types of various identifiers based on multiple hints in a deterministic way.

Hell, in Haskell inference can take you pretty far, and you can defer type errors til runtime in development and basically get an opt-in dynamically typed experience!

> you can defer type errors til runtime in development dumb question: is there any reason you'd actually want to do this?

I've never done it, but if you're used to a Python/JS/etc-like dev cycle and the idea of up-front type errors annoys you, Haskell has a perfect solution.

It's useful for prototyping, where your solution is always in flux and incomplete, but you need to test that some module is working correctly (like a hardware interface).

> Dynamic languages as a result of their "dynamicness" tend to allow much better expression of control flow when compared to static languages.

That may be, but I'll be damned if Golang's verbose type casting hasn't saved me from some bugs over the years.

Also, I think there's an argument for type declarations being expressive in their own right. I like seeing what types a function receives and returns. It's a quick jumping off point when trying to understand what the function does.

> Modern compilers for statically typed languages are really good at inferring types of various identifiers based on multiple hints in a deterministic way (see Kotlin, Swift etc)

I think it’s importatnt to mention that in Swift this lookup has exponential behavior, and can time out if the expression is too complex.

I’ve always wondered why Swift and Kotlin, which are two very similar languages of approximately the same age, have such widely different compiler stability. Kotlin feels quite solid and mature, while Swift sometimes just feels like one big hack. Considering the enormous resources Apple has compared to JetBrains, I find it strange their trajectories have been so very different.

JetBrains business is all about understanding programming languages - Java in particular. And Kotlin is very, very close to Java.

With Kotlin, JetBrains was able to see what went wrong when previous JVM languages Scala and Apache Groovy each tried to be a better Java, and not repeat their mistakes, e.g. excessive operator overloading in Scala, or lack of scalability in Groovy.

Swift is one of the most complicated languages I have ever used, similar in complexity to C++. It’s also adding features relatively quickly. I think some of the earlier releases didn’t focus as much as they should have on stability, and that this technical debt is being repaid now.

> Mentioned statically typed languages C,C++ and Java are pretty old and therefore carry some baggage of verbosity that is no longer needed.

hahaha. Just getting people to adopt `auto` in C++ is an uphill battle. People want to write types.

Not only in C++, everywhere.

It is incredible the FUD in C# and Java against var, you see endless threads of how they are now dynamically typed like JavaScript.

Sometimes I wonder how did those ended up learning to program, don't people read books any more?!

yep, only did a few C# projects but had the same experience.

Sometimes writing out the types make the code clearer (while auto is good for long type names like std::vector::iterator). And although this is an edge case, if you use template-expression libraries like Eigen, you are forbidden to use auto (otherwise the template deductions don’t work properly)

I'm curious of when this would actually happen. Some kind of recursive implicit specialization? AFAIK auto x=foo; is just sugar for decltype(foo) x=foo; rather than MLish inference.

The problem with the technique used by Eigen (and also Boost.Spirit among a few other famous libs), called expression templates, is that it leverages the fact that the references to temporaries in function calls will be maintained until the end of the expression. The last operation in the expression (assignment to a specific type) will do processing to preserve those temporaries, compute the result, etc...

e.g. it more or less looks like this in a trivial case :

    #include <cstdio>
    template<typename T1, typename T2>
    struct add {
      add(const T1& t1, const T2& t2): lhs{t1}, rhs{t2} { }
      const T1& lhs;
      const T2& rhs;
      constexpr auto get() const { return lhs.get() + rhs.get(); }
    struct integer
      const int& i;
      constexpr const int& get() const { return i; }
    int f();
    int main()
        auto x = add{add{integer{f()}, integer{f()}}, add{integer{f()}, integer{f()}}};
Here, storing the value in `auto` is wrong because by the time the expression has ended, the references point to stuff that isn't in scope anymore

Hence the solution is to introduce a type that will actually do the computation or store the result during the expression ; but before C++11 you had to introduce a type here anyways, which is what was done and of course it worked.

The general rule of thumb for us is "auto is allowed if the type is written on the line somewhere, or if you're writing a for-each of a container, where it has to be const auto &, unless you have good justification for the copies". The inadvertent copies in for-each constructs are the biggest pitfall for us, because let's just say you don't pick C++ to write software that doesn't have to be as fast as possible :) .

I've basically spent my career without static types (except for the little bits of C that happen here and there), but I keep hearing about how types would change my world ... But then evertime I've looked at a typed language lately, everything is type 'auto'. Well, maybe not Java, I don't think they have auto yet?

As an honest question, if types are good, why would you put auto, instead of telling me the type?

For context, I'm a neanderthal and just use a boring text editor, so I'm not getting any tool assisted type information if you write auto everywhere.

Because even with type inference, the compiler shouts at me during compile time if they don't match, instead of crashing at runtime, because Joe of the 5th floor forgot to add a field when he checked in his code.

So I don't need to write additional unit tests to do the compiler's work that would have prevented Joe to check in his code.

As others have said, you keep all the static checking with these systems but you get the brevity of a shorter definition.

It also makes refactoring easier. If you use var then you don't need to change all your declarations if you change a class name.

Because it's quicker to write and often more readable. Your IDE (or tooling) can provide you with the type of a variable whenever you want.

Automatic type deduction is not the same as dynamic typing. There are still types and you still get compile time errors if they don't match.

Now both Java10 and C++11 have auto type inferences: var and auto keywords respectively

That hasn't been my experience with Rust at all. Autocomplete never works.

How does this relate and compare to actual type inference? For example, Common Lisp implementation SBCL is able to infer types pretty nicely now and it doesn't need neural networks for "predicting" the types with some kind of chance.

I’m not sure how much of it is still relevant to SBCL since the fork was quite a while back, but the CMUCL user manual is a great read that describes how type inference and many other things work. https://www.cs.utexas.edu/users/jared/Milawa/Support/cmucl64...

Probably still relevant given how SBCL's user guide says (in the only section that includes mention of type inferencing):

"FIXME: The material in the CMUCL manual about getting good performance from the compiler should be reviewed, reformatted in Texinfo, lightly edited for SBCL, and substituted into this manual. In the meantime, the original CMUCL manual is still 95+% correct for the SBCL version of the Python compiler."

Data flow analysis will certainly predict a fair number of types in typical programs in dynamically-typed languages, but there will also be lots of cases it can't handle. I don't know the details of the CMUCL/SBCL type inference algorithm, but I've worked on whole-program data flow analysis, and there are still lots of types it fails to recover, for various reasons. Often, for example, the entire program is not available for analysis; if one doesn't know all the places a particular function can be called, one can't soundly infer all the types it might be passed. Nonetheless, in practice a given parameter of such a function is normally passed arguments of a particular type (or subtypes thereof), and frequently, clues such as the parameter name can help one take a pretty good guess as to what that type is. Since such a guess is sometimes going to be wrong, it can't safely be used for optimization, but it can be very useful for other purposes, such as analyzing the program to find potential security vulnerabilities.

DFA isn’t usually used in type inferencing algorithms, at least ones that generate types for programmers rather than compiler optimizations. It isn’t that it is expensive, but in general syntax tree walking works well enough.

Whole program analysis is too expensive for type inference (or much of anything for that matter) even with its precision ramped all the way down via something like CFA0. Type inference with sub typing is a very similar problem to alias analysis, which hints at why doing it in any but very restrictive contexts is too expensive.

If you don't need perfect soundness, whole-program 1-CFA can be made practical by heuristic pruning. I've done it successfully in a commercial static analysis product.

It can still be used for optimization in many cases depending on the language, e.g by generating specialised versions and shunting calls from known call sites to the specialised version when you can either guarantee the guesses are right or cheaply lift guards up the call stack.

Kind of disappointed that they don't include this kind of analysis as baseline. Especially that the types infer by traditional algorithms are usually sound, which is not what you can say for ML-based methods (so far).

Snigl [0] traces the code before running it; simulating stack contents and inferring generic calls based on that information as far as possible, among other things.

It's not perfect, but it's the only way I've found that makes sense in combination with Forth-like stack semantics where function parameters are never specified explicitly. And that runs fast enough to make sense in an interpreted language. I also like that it's implemented as a transformation on VM code, which makes it flexible and easy to debug.

[0] https://gitlab.com/sifoo/snigl#tracing

kind of cool but why is the model trained on such a small data set? why not train it on a bunch of codebases that use type hints?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact