I have often felt that there are many discontinuities between "pretty" Haskell as it is taught and pragmatic Haskell. I haven't used Julia enough in anger to say it for sure, but I see in the way it works great potential for the pragmatic code to be as beautiful as the high-level and abstract code.
For a long time I have felt that Haskell represented the most mathematical language. Julia really shows that there are other ways of building a mathematical language with taste and style. It's oriented to practitioners and applied math folks rather than computer scientists and pure mathematicians. I have enjoyed seeing the differences between these systems quite a bit, and I think Julia has a bright future as a practical, daily-use system for science.
I recently wrote a simulator intended as the demonstration of some issues in a paper. I found that using non standard characters enabled me to create a clearer implementation of the calculations in the paper in the code - so I think that it's a great thing that you can do this in Julia and that it should be encouraged.
In 2017 programmers have access to super powerful computers - some cycles to render and enable the manipulations are appropriate? What do people think?
That's true. I'm currently using an emacs extension that allows me to convert to greek unicode LaTeX-like string (\alpha, \Gamma, ...), butI understand that's more of a hack rather than an actual solution.
A couple of my previous HN comments on Julia's indexing: https://news.ycombinator.com/item?id=15472933 and https://news.ycombinator.com/item?id=15473169
Plus, I found it exciting to read community discussions for a language growing towards 1.0, to understand the different approaches that were considered, and why certain choices were made. From what I've seen, the whole development process was quite transparent, and the developers have always indulged sincere questions/suggestions from participants, dealing in concrete examples instead of sweeping generalizations and polemic. I don't know what standard to compare this to, but I've enjoyed the experience.
Disclaimer: I recently started contributing to Julia (but I wouldn't have bothered in the first place if I didn't think it was so cool ;) )
But, I think a lot of people in scientific computing are tired of the typeless mess that Python/Numpy/Scipy code-bases evolve to be. And for those people, I think it has a lot of merit.
At the end of the day, the language was designed to fill one major gap. A lot of time and effort in R&D is spent either; architecting sane C++ memory models, or reverse engineering existing Python code. Alternative well performing and safe languages like Java simply are not fast enough - to get the features of the modern CPU, you need to be native. And a side-note, MATLAB cannot usually be ran in a production environment.
Strong type safety is mostly just a waste of time.
I can describe a recent example. I had to ensure that an integer is always an int64 as the logic passes through different python modules and libraries. It was an absolute hell to track down all the places where things were dropping down to int32. With static types this would have been a no-brainer. This is not to say that I do not enjoy its dynamic typing where it is appropriate.
Hopefully Python 3 will make things better with optional types. But its still not statically typed, just a pass through a powerful linter.
That said, I find your particular example with int64 extremely hard to believe. I assume you’re using numpy or ctypes to get a fixed precision integer, in which case it should be extremely easy to guarantee no precision changes, and e.g. almost all operations between np.int64 and np.int32 or a Python infinite precision int will preserve the most restrictive type (highest fixed precision) in the operation.
I work in numerical linear algebra and data analytics and have used Python and Cython for years, often caring about precision issues— and have literally never encountered a situation where it was hard to verify what happens with precision.
Unless you’re using some non-numpy custom int64 type that has bizarre lossy semantics, it is quite hard to trigger loss of precision. And even then, a solution using numpy precision-maintaining conventions will be better and easier than some heavy type enforcement.
When rubber is about to hit the road, i.e. near deployment with money at stake, I would have love an option to freeze the types, at least in many places. Cython comes in handy, but its clunky and its syntax and semantics is not super obvious to a beginner (I am no longer one, but I remember my days of confusion regarding cyimporting std headers, python headers, how do you use python arrays (not numpy arrays) etc etc).
I am curious, have you put money at stake supported only by dynamic types ?
Regarding int32 vs int64, its not a precision issue its about sparse matrices with more than 1<<31 nonzeros. I am equally surprised that you have not run into this given your practical experience with matrices.
My case involves more than just numpy. There's hdf5, scipy.sparse, some memory mapped arrays and of course numpy.
Given the amount of time I spent to debug this, I would have killed for static type checks.
But I do see that counting nnz boils down to a call to np.count_nonzero, which treats bools as np.intp, which is either going to be int32 or int64 (very weird that it chooses signed types), then calls np.sum.
The best solution would be to use np.seterr to warn exactly at call sites with int32 overflow, but amazingly, there seems to be an open numpy issue saying that seterr is not guaranteed for sum.
I do think seterr + logging would be better for this than roping in static typing everywhere just to get a one off benefit like this.
Quick question, when you create a scipy.csr how do you ensure the subsequent multiplication operator falls back to C code that uses int64 to index the internals and not int32. I thought if indices array was a int64 array it would do the job. I was wrong. Anyway, even if that had worked it would still have fallen short of ensuring. If it worked, it just happened to work -- thats an anecdote.
If one had static typechecks one would not have to read through all the layers to be sure. Compile error, if any, would have told me.
We also cant directly use scipy.sparse because we dont have that much RAM on these machines. We do use scipy.sparse but they operate internally with memory mapped arrays. Now, depending on the platform memory mapped arrays can be limited to an index of 1<<31. So we have to be extra careful what type is used for indexing in the native libraries that these layers are a wrappers over.
BTW its far from a one off benefit. This was just one of the examples fresh in my memory. It directly affects real money. There you dont want to ship code that could have bugs that can cost you. Static types help rule out these cases once for all. With run time checks it is very hard to be sure that you have caught all of the code paths that can have these mismatches.
I agree that in grad school its different :) One can play fast and loose. Even more, if research is not expected to be reproducible -- that would be pure science.
I don’t know what you mean by “that’s just numpy” though — since even if this flows through other systems, tracking it at the source in numpy would be obvious.
“Static types help rule out these cases..” — I just disagree. That is what’s advertised, but it’s just not true. Years of working in Scala for very heavy enterprise production systems has made me realize it’s a very false promise. There are actually remarkably few classes of these errors that are removed by static type enforcement, and perfectly good patterns to deal with it in dynamic type situations.
If static typing was free, then sure, why not. But instead it’s hugely costly and kills a lot of productivity, rather than the promise that it improves productivity over time by accumulating compounding type safety benefits.
I think a good rule of thumb is that anything that causes you to need to write more code will be worse in the long run. There’s no guarantee you’ll actually face fewer future bugs with static typing and visibility noise in the code, but you can guarantee it adds more to your maintenance costs, compile times, and complexity of refactoring.
I guess Python’s gradual typing is a good compromise, since you don’t have to choose between zero type safety or speculative all-in type safety where the maintenance overhead almost always outweighs the benefits (rendering it a huge and unreconcilable form of premature optimization).
You can only add it in those few, rare places where there is demonstrated evidence that the static typing optimization actually has a payoff.
You cant possibly be saying that ! even if one assumes that source is numpy.
Regarding the rest, lets say my experience with Ocaml has been more gratifying than yours with Scala.
> We just accumulate the count with a python integer.
That wont help when you are using scipy.sparse for sparse on sparse multiplication, because the multiplications fall back to C code. You have to ensure that it falls back to C code that uses Int64 for indexing the arrays. I am sure you are not saying that you do sparse multiplications of this size in pure python.
Our differences in tastes aside, you seem to work on interesting stuff. Would love exchanging notes in case we run into each other one day. Should be fun.
For csc and csr matrices at least, these operations typically iterate the underlying indices, indptr and data arrays, and csc `nonzero` uses len(indices), which both relies on (eventually) the C-level call to malloc that defined `indices` (and so uses the systems address space precision, and would never report number of elements in a lower precision int than what the platform supports for memory addressing), and returns this as an infinite precision Python int. Afterwards it only uses arrays of indices, not integers holding sizes.
Long story short is that at least for csc matrices, the issue you describe wouldn’t be possible internally to scipy’s C operations, as you’d always be dealing with an integer type large enough for any possible contiguous array length that can be requested on that platform (and the nonzero items are stored in contiguous arrays under the hood).
On my team we are not doing pure Python ops on the sparse matrices, rather we needed customized weighted operations (for a sparse search engine representation that weights bigrams, trigrams, trending elements, etc., in customized ways) and some set operations to filter rows out of sparse matrices.
So we basically rip the internal representation (data, indices, and indptr) out of csc matrices and pass them into a toolkit of numba functions that we have spent time optimizing.
The code that will get called for a multiply is this https://github.com/scipy/scipy/blob/master/scipy/sparse/spar... and https://github.com/scipy/scipy/blob/master/scipy/sparse/spar...
It's important that decisions at the python level trickles down to the correct choice when it comes down to this level.
On a 64 bit architecture one would expect that using 64bit int arrays for indices and indptr would ensure that. But thats not the way it works. We regularly encountered cases where it would call the code corresponding to int32. I know why and have special checks and jump hoops to prevent this.
Thats besides the point, with static types I wouldn't need to do this, the compiler would take care of it.
I appreciate your effort to dig through the logic. You have spent time speaking at length in the comment above but unfortunately said little. Malloc has nothing to do with it.
Your third paragraph is manifestly false. Why do I say so ? Because I deal with this everyday and have counterexamples.
I didnt mean to ask you to find out. Apologies if I wasted your time. I already know why the type mismatch happens. My point was to demonstrate that a lot of manual wading is needed to ensure that it finally bottoms out by calling native code with correct type.
Can you post a gist or link some other concrete example to show how it can overflow the intp type based on large nnz? Reading the code, it looks like this could not happen.
(Note that the entire second step function wouldn’t have this problem, because it’s accessing indices inside the other arrays, after nnz has already been computed, and is not looping over a variable that would overflow, apart from nnz from the first function, which I pointed out above seems not to overflow unless you’re compiling things in a non-standard way that affects npy_intp).
I don’t know what your comments about malloc having nothing to do with it are though. That is how numpy arrays possess their post-allocation result for __len__, such as for indices, indptr and data in csr. So __len__ could not overflow an int type (since it requires the platform address space’s int type to allocate underlying contiguous arrays and returns a Python integer).
I have mentioned earlier that it is not about precision but about index space. I don't think it's going to be productive use of my time to continue this thread.
One specific reason you are getting confused is because you are looking at function calls in isolation not the entire chain of calls through the different Python ecosystem libraries. The problem is the indptr and indices arrays that begin their life as int64 arrays get transformed into int32 arrays in specific code paths.
By stating that malloc is not relevant I mean its not relevant in this particular instance. By the time control reaches malloc the type mismatch damage has laready taken place.
Getting a runtime error is far from the end of the matter even if in certain cases we do get runtime errors. What static types saves the user from is the hunting needed to find out where in the chain of functions are we losing the type invariant we need.
Stopping such bugs is a no-brainer with static types. You claimed at one point up-streams  that type systems cannot rule out such errors. If you believe that, this discussion is a waste of time. That's one of the lowest forms of errors a type-system prevents. Your comments like these make me doubt your grasp over these things.
BTW, not sure if you believe large rows and cols imply large nnz . That's not how sparse works.
Given your handle I would have expected you to be familiar, this is bread and butter stuff in day to day ML. On the other hand if your background was stats I would expect less of computational nitty gritties. Nothing wrong with that they focus on different but important aspects.
If you really care I would encourage you to track the flow of code from csr creation in scipy. sparse using memory mapped arrays of indptr, indices and nnz to the C code that will get invoked on two such objects, carefully. The key word here is carefully. There is no nonstandard compilation because there is no compilation. Its about dispatch to the correct C function.
You seem to believe that on a 64 bit platform such indexing error will not happen. That's patently false because it happened many times.
In other words, you are saying your ill conceived and incompletely considered notion of correctness are more correct and than test cases that fail.
This exactly where a static type system would have helped. Those ill conceived incomplete understanding would have been replaced by a proof that proper types have been maintained over the possible flows of control. In this case it would have saved me a lot of time tracking cases where int64 is dropping down to int32.
At this point I would stop engaging in this conversation because it has become an exercise in pointless dogma.
If you refuse to accept that runtime errors detected or undetected dont have a cost, or that static types can mitigate such costs, -- whatever rocks your boat. What I am claiming is that several times in my Python/Cython use I hit instances where static types would have saved a lot of trouble and time and money.
Another common type related problem happens when you need to ensure things remain float32 and do not get promoted to float64. I work both on the large and in the small, so I encounter these.
Right, and if you read in my comments it shows I have also been only talking about the index space as well, where int64 vs int32 is a matter of int precision for representing large amounts of indices, but where npy_intp will be of the higher precision (to match the platform’s address space) and will not be able to suffer the overflow issue you described unless it’s a custom compilation of numpy defining npy_intp as int32 even on a 64-bit system (which you seem confused about by repeatedly saying compilation isn’t a part of it, as if anyone is suggesting your personal workflow involves compiling anything, when I’m talking about how the numpy you have installed was compiled. If it’s standard numpy compiled for a 64-bit system, then the evidence suggests your claim is just wrong.)
You claim that indptr and indices arrays are silently converted from int64 to int32 on 64-bit platforms but you offer no evidence. You just keep saying that it happened to you, despite the actual code you linked indicating that it couldn’t happen. And I do actually work with indptr and indices arrays with tens of billions of nonzero elements in an Alexa top 300 website search engine every day, and have never encountered any such silent type conversion.
Given that this example of index int precision actually seems unfounded in the code, it just doesn’t seem relevant to any sort of static vs dynamic typing debate. There’s no such issue here that static typing would help with, because it’s clearly not causing the problem you think it’s causing in the dynamic typing code.
There was a recent thread on HN on Ousterhout on exactly that https://news.ycombinator.com/item?id=17779953
Re evidence, you cant possibly expect me to replicate the entire software library stack here on HN or elsewhere to show the loss of type information.
Hint you are still looking at functions in isolation and repeating loss of type info cannot happen and I have dealt with hundreds of counter examples. You can look up the conversation, I did not say the pointed Github code is to blame. I pointed the code saying we have to make sure that piece of code is eventually called with the right type. We have to ensure that at the time of creation of the sparse matrices. With the dynamic type handling this gets lost. Int64 gets dropped to Int32.
The marketing pitch for that is always something like QuickCheck in Haskell, where e.g. reversing an array should be its own inverse function and you can auto-verify this like it is a law across a bunch of cases.
The problem is in real life unit tests, nothing has any laws like this, and it’s just a bunch of bizarre case-specific business logic and reporting code. The concept of a corner cass is a semantic one, and the definition of what inputs are possible to a given function will change and have constraints from the outside world that not even the most expressive statically typed language will easily let you encode into the type system.
Combine it with the fact that your colleagues have variability in their skills too, and often won’t make good choices with type system abstractions to represent business logic, and then all that costly extra boilerplate code for specifying types, creating your own business-logic-specific ADTs, adding privacy modifiers, templated or type classes implementations...
...it just becomes a big pile of garbage liabilities for what turns out to seriously be no benefits over dynamic typing.
Even in the static typing case, you’ll end up with tons of runtime errors causing you to frequently revisit assumptions in the unit tests. You’ll just have a harder time refactoring large pieces of code that are wedded to particular type designs and you’ll have to sit and wait on the compiler to try every change (this can be hugely bad when the system has components needed for rapid prototyping, interactive data analysis, or other real-time uses).
I’ve really seen a lot of corners of this debate play out in practice, and static typing beyond extremely simple native types and structs (basically C style), really offers nothing while being a huge productivity drain. The claims that it actually helps productivity because the compiler catches errors and forces more correctness just turns out to be false in real code bases. You get just as many weird runtime errors and just have a harder time debugging or rapidly experimenting with changes.
I liken it to David Deutsch’s comments on good systems of government in his book The Beginning of Infinity where he advised that the trait you should use to evaluate a system of government is not whether it produces good policies, but rather how easy it is to remove bad policies.
Languages don’t cause people to invent better designs for mapping between the real world and software abstractions. They can provide tools to help, but they don’t cause the design.
But statically typed languages do create boilerplate and sunk cost fallacies leading to living with bad designs and accepting limitations that have to be coded around.
Dynamic typing compares favorably in this regard: it is very easy to rip things out or treat a function as if it implicitly handles multiple dispatch (because you can specialize on runtime types with no overhead and no enforcement on type signatures or type bounds), and quickly get feedback on whether a design will be a good idea, or what things would look like ripping out some bad design.
I think exactly what you describe is the hyped up promise that static typing, especially in functional languages, fails to actually deliver in practice. You can still write production code that way, just incurring costs of maintenance of more code & boilerplate without the supposed offsetting benefits of catching more bugs, reducing runtime errors, or communicating design more smoothly in the type system, except in isolated, small parochial cases.
Somewhere along the spectrum of increasing cost to business of runtime errors the needle switches in favor of static tyoes. This is more true when you ship applications to folks who dont necessarily know or care about the internals. Throwing runtime errors is just a bad form in those cases.
When the code is going to be deployed on infrastructure you control, there is a lot more leeway to absorb runtime errors. The choice depends on how costly the runtime errors are and how costly are the fixes. Time being part of the cost.
But in both commercial R&D and academic R&D, some code really needs to live on for future people to benefit from it. It's just not a good use of research time for people to re-write everything from scratch... every time.
Haskell e.g. is very powerful and elegant but time consuming and hard to learn. LISP is easy to learn and powerful but has kind of clunky syntax.
Ruby has quite nice syntax and is quite powerful but also kind of messy. Python is quite clean and easy to use but not as powerful.
Julia I would say has hit a sweet spot between all these languages. It is quick to learn and understand while also allowing you to write clean easy to read code. That may describe python. But with macros and multiple dispatch I would say it is a much more powerful language.
I also find it much nicer than Python to use as a script language as you got way more functionality out of the box.
I am a C++ developer professionally, and write little Julia script to help me with various boilerplate coding inn C++, processing assets etc. Julia is really quick to drop into when you need it.
With python I always forget which module some functionality is in. The name of a function etc. Julia has much better naming most useful stuff is already invluded in the automatically loaded base module.
JuMP is why I learned whatever I did of Julia and that was mostly because it allowed me to express some things better (not necessarily faster or in parallel).
A good example would be this random problem that came out of an interview discussion.
I'm almost sure my python implementation is wrong, but I can't quite prove it - the Julia one is trivial to understand (a dot product + a minimization function).
As far as comparing code complexity of Julia to Python is concerned, I would say that when you use JuMP in Julia, you should use Pyomo/CasADi/PuLP/... in Python. That is not to say that I don't find JuMP to be a more appealing framework overall. It has wide support for all kinds of solvers, and some Julia/JuMP authors even wrote fairly good MIQCP/MICP solvers on top of commercial/open-source MILP/SOCP solvers.
Why do you say that? Perhaps if the knapsack problem is provably non-polynomial, but being able to distinguish those cases is a rather important skill.
For example, the problem you specified can more easily be solved directly: https://repl.it/repls/RotatingObeseBusinesses
With respect to the interview question context, if I had a candidate that implemented a solver, I would be inclined to say they completely overengineered the solution for a less optimal answer. There are a bunch of assumptions that are required for a solver to be optimal and unless the candidate can enumerate all of them and argue why the problem meets those constraints, the solver is not 100% guaranteed to be correct, whereas a direct solution is.
Not to mention, as you noted, edge cases, and the complete cryptic-ness of the code.
But I would say Julia is increasingly getting there. Comparable packages are WAY easier to write in Julia than in Stata/Mata, while being faster, so any gaps will keep disappearing in the next hears.
gen x = y if z > 4 // headaches abound
* For reference, stata uses +Inf as missing value, so any operation with "greater then" is going to assign missing values to something. And yes, there have research papers retracted due to this behavior.
I don't think a standardized and idiomatic data-cleaning process has been established yet, which is for the best right now. There is `JuliaDBMeta` for metaprogramming with JuliaDB tables, and the `Queryverse` for working with a wide array of objects.
One way that Julia's metaprogramming shines is with the ability to go into the AST and replace symbols, enabling local scopes that are more readable than other scopes. One workflow I'm excited to experiment with is something like this
@as my_long_dataset d begin # make d = my_long_dataset in this scope
@with d begin
t = :x1 + :x2 + x3 # these symbols are arrays inside this @with scope
d.new_var = t # assign the variable
$ aurman -S julia
(rolls up sleeves)
To put it simply: he is implying that he will check out the book.
From what I understand it's more that the combination of other features in the Julia language (like multiple dispatch) makes classes redundant.
Why? I programmed in Lua which also made such choice, and I find it rather inconvenient. Makes you always think if you got your indices right, since in all other programming languages indices start from 0.