
Think Julia: How to Think Like a Computer Scientist - cdsousa
https://benlauwens.github.io/ThinkJulia.jl/latest/book.html
======
ssivark
PSA: The goal of the Julia 1.0 release was to stabilize language constructs
for library authors to build upon. It will take time for them to update their
libraries to be compatible with v1.0 -- so if you want to get a feel for the
language, and things are breaking, stick to v0.7 for the near future.

~~~
beyondCritics
Sadly not even 0.7 is usable right now, see
[https://docs.julialang.org/en/v1.0.0/](https://docs.julialang.org/en/v1.0.0/)
__* The only difference between 0.7 and 1.0 is the removal of deprecation
warnings. __* I am currently using version 0.6.4. I feel it is still a great
piece of software, even if is a few years old.

------
cuchoi
If I don't care about parallelism nor speed, is there a reason to learn Julia?

~~~
cjhanks
As somebody who appreciates Julia, my opinion is - probably not. That is,
unless you want the opportunity to create a killer library that shows how the
language features map well to other domains.

But, I think a _lot_ of people in scientific computing are tired of the
typeless mess that Python/Numpy/Scipy code-bases evolve to be. And for those
people, I think it has a _lot_ of merit.

At the end of the day, the language was designed to fill one major gap. A lot
of time and effort in R&D is spent either; architecting sane C++ memory
models, or reverse engineering existing Python code. Alternative well
performing and safe languages like Java simply are not fast enough - to get
the features of the modern CPU, you need to be native. And a side-note, MATLAB
cannot usually be ran in a production environment.

~~~
mlthoughts2018
Having spent years working with numpy and Cython, then switching to Scala for
years as well, I much prefer dynamic typing.

Strong type safety is mostly just a waste of time.

~~~
srean
As a long time Python/Cython user I can say that I have sorely missed static
types in many occasions, especially for long running tasks. In fact I would
sometimes use Cython not for performance but as a type checker.

I can describe a recent example. I had to ensure that an integer is always an
int64 as the logic passes through different python modules and libraries. It
was an absolute hell to track down all the places where things were dropping
down to int32. With static types this would have been a no-brainer. This is
not to say that I do not enjoy its dynamic typing where it is appropriate.

Hopefully Python 3 will make things better with optional types. But its still
not statically typed, just a pass through a powerful linter.

~~~
mlthoughts2018
While I certainly concede that dynamic typing will have painpoints like this,
I just think on balance they create far fewer problems than the maintenance
and inflexibility of type system enforcement patterns.

That said, I find your particular example with int64 extremely hard to
believe. I assume you’re using numpy or ctypes to get a fixed precision
integer, in which case it should be extremely easy to guarantee no precision
changes, and e.g. almost all operations between np.int64 and np.int32 or a
Python infinite precision int will preserve the most restrictive type (highest
fixed precision) in the operation.

I work in numerical linear algebra and data analytics and have used Python and
Cython for years, often caring about precision issues— and have literally
never encountered a situation where it was hard to verify what happens with
precision.

Unless you’re using some non-numpy custom int64 type that has bizarre lossy
semantics, it is quite hard to trigger loss of precision. And even then, a
solution using numpy precision-maintaining conventions will be better and
easier than some heavy type enforcement.

~~~
srean
I will agree about the 'on the balance' in the context of speed of prototyping
and interactive sessions.

When rubber is about to hit the road, i.e. near deployment with money at
stake, I would have love an option to freeze the types, at least in many
places. Cython comes in handy, but its clunky and its syntax and semantics is
not super obvious to a beginner (I am no longer one, but I remember my days of
confusion regarding cyimporting std headers, python headers, how do you use
python arrays (not numpy arrays) etc etc).

I am curious, have you put money at stake supported only by dynamic types ?

Regarding int32 vs int64, its not a precision issue its about sparse matrices
with more than 1<<31 nonzeros. I am equally surprised that you have not run
into this given your practical experience with matrices.

My case involves more than just numpy. There's hdf5, scipy.sparse, some memory
mapped arrays and of course numpy.

Given the amount of time I spent to debug this, I would have killed for static
type checks.

~~~
mlthoughts2018
I happen to use scipy sparse csc and csr matrices for huge sparse tfidf data
at work, but never encountered this (we have a numba utility function for
operations we do directly on the data, indices, and indptr internal arrays,
including counting).

But I do see that counting nnz boils down to a call to np.count_nonzero, which
treats bools as np.intp, which is either going to be int32 or int64 (very
weird that it chooses signed types), then calls np.sum.

The best solution would be to use np.seterr to warn exactly at call sites with
int32 overflow, but amazingly, there seems to be an open numpy issue saying
that seterr is not guaranteed for sum.

I do think seterr + logging would be better for this than roping in static
typing everywhere just to get a one off benefit like this.

~~~
srean
But thats just Numpy. As I mentioned the logic flows through other components
too. I am guessing your nnzs are medium sized and hasnt hit 2 billion yet.

Quick question, when you create a scipy.csr how do you _ensure_ the subsequent
multiplication operator falls back to C code that uses int64 to index the
internals and not int32. I thought if indices array was a int64 array it would
do the job. I was wrong. Anyway, even if that had worked it would still have
fallen short of _ensuring_. If it worked, it just happened to work -- thats an
anecdote.

If one had static typechecks one would not have to read through all the layers
to be sure. Compile error, if any, would have told me.

We also cant directly use scipy.sparse because we dont have that much RAM on
these machines. We do use scipy.sparse but they operate internally with memory
mapped arrays. Now, depending on the platform memory mapped arrays can be
limited to an index of 1<<31\. So we have to be extra careful what type is
used for indexing in the native libraries that these layers are a wrappers
over.

BTW its far from a one off benefit. This was just one of the examples fresh in
my memory. It directly affects real money. There you dont want to ship code
that could have bugs that can cost you. Static types help rule out these cases
once for all. With run time checks it is very hard to be sure that you have
caught all of the code paths that can have these mismatches.

I agree that in grad school its different :) One can play fast and loose. Even
more, if research is not expected to be reproducible -- that would be pure
science.

~~~
mlthoughts2018
Our nnz is certainly far greater than 2 billion. The matrix size is around 150
million rows by around 1.7 million columns. We just accumulate the count with
a python integer.

I don’t know what you mean by “that’s just numpy” though — since even if this
flows through other systems, tracking it at the source in numpy would be
obvious.

“Static types help rule out these cases..” — I just disagree. That is what’s
advertised, but it’s just not true. Years of working in Scala for very heavy
enterprise production systems has made me realize it’s a very false promise.
There are actually remarkably few classes of these errors that are removed by
static type enforcement, and perfectly good patterns to deal with it in
dynamic type situations.

If static typing was free, then sure, why not. But instead it’s hugely costly
and kills a lot of productivity, rather than the promise that it improves
productivity over time by accumulating compounding type safety benefits.

I think a good rule of thumb is that anything that causes you to need to write
more code will be worse in the long run. There’s no guarantee you’ll actually
face fewer future bugs with static typing and visibility noise in the code,
but you can guarantee it adds more to your maintenance costs, compile times,
and complexity of refactoring.

I guess Python’s gradual typing is a good compromise, since you don’t have to
choose between zero type safety or speculative all-in type safety where the
maintenance overhead almost always outweighs the benefits (rendering it a huge
and unreconcilable form of premature optimization).

You can only add it in those few, rare places where there is demonstrated
evidence that the static typing optimization actually has a payoff.

~~~
srean
> since even if this flows through other systems, tracking it at the source in
> numpy would be obvious.

You cant possibly be saying that ! even if one assumes that source is numpy.

Regarding the rest, lets say my experience with Ocaml has been more gratifying
than yours with Scala.

> We just accumulate the count with a python integer.

That wont help when you are using scipy.sparse for sparse on sparse
multiplication, because the multiplications fall back to C code. You have to
ensure that it falls back to C code that uses Int64 for indexing the arrays. I
am sure you are not saying that you do sparse multiplications of this size in
pure python.

Our differences in tastes aside, you seem to work on interesting stuff. Would
love exchanging notes in case we run into each other one day. Should be fun.

~~~
mlthoughts2018
> “You have to ensure that it falls back to C code that uses Int64 for
> indexing the arrays. I am sure you are not saying that you do sparse
> multiplications of this size in pure python.”

For csc and csr matrices at least, these operations typically iterate the
underlying indices, indptr and data arrays, and csc `nonzero` uses
len(indices), which both relies on (eventually) the C-level call to malloc
that defined `indices` (and so uses the systems address space precision, and
would never report number of elements in a lower precision int than what the
platform supports for memory addressing), and returns this as an infinite
precision Python int. Afterwards it only uses arrays of indices, not integers
holding sizes.

Long story short is that at least for csc matrices, the issue you describe
wouldn’t be possible internally to scipy’s C operations, as you’d always be
dealing with an integer type large enough for any possible contiguous array
length that can be requested on that platform (and the nonzero items are
stored in contiguous arrays under the hood).

On my team we are not doing pure Python ops on the sparse matrices, rather we
needed customized weighted operations (for a sparse search engine
representation that weights bigrams, trigrams, trending elements, etc., in
customized ways) and some set operations to filter rows out of sparse
matrices.

So we basically rip the internal representation (data, indices, and indptr)
out of csc matrices and pass them into a toolkit of numba functions that we
have spent time optimizing.

~~~
srean
Lets not weasel with 'typically'.

The code that will get called for a multiply is this
[https://github.com/scipy/scipy/blob/master/scipy/sparse/spar...](https://github.com/scipy/scipy/blob/master/scipy/sparse/sparsetools/csr.h#L562)
and
[https://github.com/scipy/scipy/blob/master/scipy/sparse/spar...](https://github.com/scipy/scipy/blob/master/scipy/sparse/sparsetools/csr.h#L609)

It's important that decisions at the python level trickles down to the correct
choice when it comes down to this level.

On a 64 bit architecture one would expect that using 64bit int arrays for
indices and indptr would ensure that. But thats not the way it works. We
regularly encountered cases where it would call the code corresponding to
int32. I know why and have special checks and jump hoops to prevent this.

 _Thats besides the point, with static types I wouldn 't need to do this, the
compiler would take care of it._

I appreciate your effort to dig through the logic. You have spent time
speaking at length in the comment above but unfortunately said little. Malloc
has nothing to do with it. Your third paragraph is manifestly false. Why do I
say so ? Because I deal with this everyday and have counterexamples.

I didnt mean to ask you to find out. Apologies if I wasted your time. I
already know why the type mismatch happens. My point was to demonstrate that a
lot of manual wading is needed to ensure that it finally bottoms out by
calling native code with correct type.

~~~
mlthoughts2018
The code you linked actually seems to refute your claim of this precision
error, at least for multiply, because it is using npy_intp for nnz, which will
be int64 on a 64 bit platform, and there is even an overflow check below!

Can you post a gist or link some other concrete example to show how it can
overflow the intp type based on large nnz? Reading the code, it looks like
this could not happen.

(Note that the entire second step function wouldn’t have this problem, because
it’s accessing indices inside the other arrays, after nnz has already been
computed, and is not looping over a variable that would overflow, apart from
nnz from the first function, which I pointed out above seems not to overflow
unless you’re compiling things in a non-standard way that affects npy_intp).

I don’t know what your comments about malloc having nothing to do with it are
though. That is how numpy arrays possess their post-allocation result for
__len__, such as for indices, indptr and data in csr. So __len__ could not
overflow an int type (since it requires the platform address space’s int type
to allocate underlying contiguous arrays and returns a Python integer).

~~~
srean
Can't help but say this, you are seriously confused. Not necessarily your
fault, as obviously you dont have the full code.

I have mentioned earlier that it is not about precision but about index space.
I don't think it's going to be productive use of my time to continue this
thread.

One specific reason you are getting confused is because you are looking at
function calls in isolation not the entire chain of calls through the
different Python ecosystem libraries. The problem is the indptr and indices
arrays that begin their life as int64 arrays get transformed into int32 arrays
in specific code paths.

By stating that malloc is not relevant I mean its not relevant in this
particular instance. By the time control reaches malloc the type mismatch
damage has laready taken place.

Getting a runtime error is far from the end of the matter even if in certain
cases we do get runtime errors. What static types saves the user from is the
hunting needed to find out where in the chain of functions are we losing the
type invariant we need.

Stopping such bugs is a no-brainer with static types. You claimed at one point
up-streams [0] that type systems cannot rule out such errors. If you believe
that, this discussion is a waste of time. That's one of the lowest forms of
errors a type-system prevents. Your comments like these make me doubt your
grasp over these things.

BTW, not sure if you believe large rows and cols imply large nnz [1]. That's
not how sparse works.

Given your handle I would have expected you to be familiar, this is bread and
butter stuff in day to day ML. On the other hand if your background was stats
I would expect less of computational nitty gritties. Nothing wrong with that
they focus on different but important aspects.

If you really care I would encourage you to track the flow of code from csr
creation in scipy. sparse using memory mapped arrays of indptr, indices and
nnz to the C code that will get invoked on two such objects, carefully. The
key word here is _carefully_. There is no nonstandard compilation because
there is no compilation. Its about dispatch to the correct C function.

You seem to believe that on a 64 bit platform such indexing error will not
happen. That's patently false because it happened many times.

In other words, you are saying your ill conceived and incompletely considered
notion of correctness are more correct and than test cases that fail.

This exactly where a static type system would have helped. Those ill conceived
incomplete understanding would have been replaced by a proof that proper types
have been maintained over the possible flows of control. In this case it would
have saved me a lot of time tracking cases where int64 is dropping down to
int32.

At this point I would stop engaging in this conversation because it has become
an exercise in pointless dogma.

If you refuse to accept that runtime errors detected or undetected dont have a
cost, or that static types can mitigate such costs, -- whatever rocks your
boat. What I am claiming is that several times in my Python/Cython use I hit
instances where static types would have saved a lot of trouble and time and
money.

Another common type related problem happens when you need to ensure things
remain float32 and do not get promoted to float64. I work both on the large
and in the small, so I encounter these.

[0]
[https://news.ycombinator.com/item?id=17789837](https://news.ycombinator.com/item?id=17789837)

[1]
[https://news.ycombinator.com/item?id=17789837](https://news.ycombinator.com/item?id=17789837)

~~~
mlthoughts2018
> “I have mentioned earlier that it is not about precision but about index
> space.”

Right, and if you read in my comments it shows I have also been only talking
about the index space as well, where int64 vs int32 is a matter of int
precision for representing large amounts of indices, but where npy_intp will
be of the higher precision (to match the platform’s address space) and will
not be able to suffer the overflow issue you described unless it’s a custom
compilation of numpy defining npy_intp as int32 even on a 64-bit system (which
you seem confused about by repeatedly saying compilation isn’t a part of it,
as if anyone is suggesting your personal workflow involves compiling anything,
when I’m talking about how the numpy you have installed was compiled. If it’s
standard numpy compiled for a 64-bit system, then the evidence suggests your
claim is just wrong.)

You claim that indptr and indices arrays are silently converted from int64 to
int32 on 64-bit platforms but you offer no evidence. You just keep saying that
it happened to you, despite the actual code you linked indicating that it
couldn’t happen. And I do actually work with indptr and indices arrays with
tens of billions of nonzero elements in an Alexa top 300 website search engine
every day, and have never encountered any such silent type conversion.

Given that this example of index int precision actually seems unfounded in the
code, it just doesn’t seem relevant to any sort of static vs dynamic typing
debate. There’s no such issue here that static typing would help with, because
it’s clearly not causing the problem you think it’s causing in the dynamic
typing code.

~~~
srean
All I can say is you will learn a little bit more if your "know it all"
persona is in proportion with your actual extensive knowledge :)

There was a recent thread on HN on Ousterhout on exactly that
[https://news.ycombinator.com/item?id=17779953](https://news.ycombinator.com/item?id=17779953)

Re evidence, you cant possibly expect me to replicate the entire software
library stack here on HN or elsewhere to show the loss of type information.

Hint you are still looking at functions in isolation and repeating loss of
type info cannot happen and I have dealt with hundreds of counter examples.
You can look up the conversation, I did not say the pointed Github code is to
blame. I pointed the code saying we have to make sure that piece of code is
eventually called with the right type. We have to ensure that at the time of
creation of the sparse matrices. With the dynamic type handling this gets
lost. Int64 gets dropped to Int32.

------
gerdesj
You might contrast the approach here with say an Engineering textbook. This
manual on a particular tool (Julia) seems to imply that it is the one way to
engage with an entire discipline. An Engineering textbook might mention
various tools for a particular job and even endorse one over the others but in
general it will start with the problem and not the solution.

That said:

$ aurman -S julia

(rolls up sleeves)

~~~
3rdAccount
Can you explain that last part?

~~~
nur0n
`aurman` is a (unofficial?) package manager for Arch Linux. The standard
package manager is called `pacman`. Arch User Repository(AUR) is (IIRC) a
repository of uncurated packages compatible with Arch.

To put it simply: he is implying that he will check out the book.

~~~
gerdesj
Yes, I should have put pacman. The book is a great resource and despite my
criticism has introduced me to julia.

------
inamberclad
Never wrapped my head around Julia. I like it, and I've used it for a couple
things, but I've never had a use case compelling enough to keep at it.

~~~
sgt101
It's elegant and powerful; there are very few coding constructs that are
widely used that aren't in Julia, and those that are (like Classes) aren't
there because the authors of the language don't think that they are useful, as
opposed to "it's hard to implement". But YMMV, the downside is that the
ecosystem is evolving, and it's just hit 1.0 so expect things to be smooth in
6mths to a year. The upside is that I find that the Julia code I write appears
from the keyboard easily, quickly and in a form that I can understand a few
weeks or a few months later.

~~~
vanderZwan
> _those that [aren 't in the language ] (like Classes) aren't there because
> the authors of the language don't think that they are useful_

From what I understand it's more that the combination of other features in the
Julia language (like multiple dispatch) makes classes redundant.

------
alexeiz
> In Julia indices start from 1.

Why? I programmed in Lua which also made such choice, and I find it rather
inconvenient. Makes you always think if you got your indices right, since in
all other programming languages indices start from 0.

------
pwaai
im trying to figure out how to apply this to coding challenges.

