Hacker News new | past | comments | ask | show | jobs | submit login
Julia on Google TPU: Shakespeare RNN (research.google.com)
209 points by KenoFischer 4 months ago | hide | past | web | favorite | 38 comments

I’ve been using Julia for a project at work and it’s been a pretty fantastic experience.

I’ve only being doing some analysis stuff at the moment, but I’ve got a few machine learning/NLP projects coming up that I’m super excited to use Julia and Flux for!

How did u find the compilation process

Package compilation or code compilation?

Package compilation was painless, code compilation isn't really noticeable once it starts running and the benefits you get from native speed are evident. I recently learnt that the IO functions are async as well (powered by libUV!) and the parallelism is easily an order of magnitude nicer to work with than Python's (and it's only going to get better).

Oh, I should have mentioned somewhere. Credit for putting the notebook together goes to Elliot Saba (https://github.com/staticfloat) :).

This is brilliant work. I'm not strong in NNs yet, but I am strong in prerequisites/blockers. This demos:

* Working in a rapid application development (RAD) fashion by operating on vectors using a language like Julia/MATLAB/Octave/Scilab which allows focusing on abstractions instead of implementation details and other distractions.

* Running code optimized automagically on GPU/TPU/etc.

* Sharing work over the web in a standard fashion (Jupyter Notebook on colab.research.google.com)

It's not clear to me where in this process the code is actually run on TPU (maybe someone has a tutorial?) but that doesn't really matter. The specific machine learning algorithm used is also not really that important.

The important part is that this enables amateurs to tinker with machine learning, see results quickly and share their work. Which means that now we'll finally see the accelerated evolution of machine learning.

Any of these blockers alone hindered the evolution of AI for decades, but seeing all three knocked down in one fell swoop is pretty astonishing, at least for me. I favorited it as a watershed moment in the history of AI! Congrats to him.

"Error Could not access the resources needed to display output. This is probably because third-party cookies are not allowed by your browser. SecurityError: The operation is insecure."

You know what doesn't throw errors whenever I just try to look at it? Products like Jupyter notebooks - stuff not made by google. I believe Colab is a using dark patterns to discourage blocking 3rd party cookies by subtly breaking what should render as a simple non interactive webpage to passive viewers.

To me, at least, Google and their dark patterns are considered harmful.

Here's what I white-listed on uMatrix:

* raw.githubusercontent.com — it's loading the notebook directly from GitHub, so yeah, this one is fairly essential. Just need one XHR here.

* googleusercontent.com — no cookies, but it does load scripts, css, a frame and an XHR here.

That's it. There are some other domains it hits (like fonts.google.com and gstatic.com), but they're not needed to view the file.

Thanks! But instead, I'll just avoid Google.

That's a whole lot of work for a result that's indistinguishable from a Markov chain generator..

> indistinguishable from a Markov chain generator

A Markov chain generator wouldn't:

* Capitalize the first word of each line

* Make lines of approximately the right length

* Mark text with who is to speak it

While this is just a toy example, it's powerful enough to start showing the ways RNNs can produce text that looks superficially correct.

(Generating Shakespeare is actually one of the examples given in the classic http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

You're mistaking a Markov chain toy with an actual Markov chain generator.

1. Yes it would. It would see capitalized words as high probability for first word on a line or after a point.

2. Obviously it could, depending on stop condition. Especially if you include line length.

3. If trained on corpus of plays, for sure it would.

The strength on the RNN is supposed to be in context and memory... Perhaps handling of grammar.

There are advanced hierarchical grammars that are related to Markov random field models that are about on par with RNN based on many text and music analysis loads. (In fact probabilistic math is often used to describe results and workings of a deep NN anyway.)

Sure, the point is

  1. To use a small dataset that can be fed from the relatively small Colab VM (so people can play with it themselves)
  2. To use a well known model (so people can focus on the implementation parts of it)
There's a resnet example in the previous example in the series (though the dataset is too large to feed from Colab).

Also, this is very early stage software, so things are a bit more verbose than they should be :).

Am I mistaken that this is interesting (and possibly also complex) because it's using the TensorFlow (out of julia) library to leverage the TPUs, while the actual ML algo is implemented in Flux, a library which has no TPU support?

There's very little TensorFlow here. We're using it as a glorified grpc client basically to ship XLA to the TPU. The actual mechanism is described in https://arxiv.org/abs/1810.09868. The model definition itself uses the layers from Flux, but there's a couple assumptions that Flux makes that don't hold for the TPU so we don't get to use everything from Flux (e.g. the training loop and the optimizers). Luckily everything is julia, so we can get TPU-compatiable versions in just a few lines of code. Unifying the abstractions is an active work in progress, but this shows off that Julia runs on TPUs at all (an on freely available, public infrastructure at that), which is pretty cool, because right now the only other TPU frontend that can say the same is TensorFlow (which had a bit of a head start ;).

Not to detract from Julia, but I’ve always found it odd that Julia users report that “Julia runs on X” when in fact it’s more that one can use X from Julia or generate code for X by rewriting Julia code.

By analogy, and contrast, no one says Python runs on GPUs, just because TensorFlow allows describing models in Python that then get run on the GPU, or Numba rewrites Python loops to CUDA PTX.

It looks like marketing tbh despite know that Julia is a very solid language technically and shouldn’t need these kinds of rhetorical tricks.

So to me the biggest semantic question are the following:

  - Does what you have to write to run on X follow the semantics of the language?
  - Can you use data structures/code defined in libraries that don't know about your thing?
In the case of julia on TPU here, the answer is yes! (surprisingly perhaps, and getting this to work is pretty hard). In particular, you get a lot of julia's language features: Multiple dispatch, control flow, etc. It's a bit of a subset of the full language (e.g. no mutation at the moment), but everything that's supported is just standard julia and we're working on growing that subset more and more.

That approach is very different from something like TensorFlow where you're essentially metaprogramming an expression graph. Numba probably counts for python (yeah, you have to put an annotation on things, but if the python people really wanted, they could probably import Numba into core cpython and make it work more smoothly). Of course in python, you have the additional complication that most of the core implementation is not in python itself, so even if you satisfy my two criteria above for the core language, you're still gonna have to rewrite the whole standard library.

It's a nice demo, but reusing Julia libraries not intended for TensorFlow seems like a fragile thing? Just because it works today doesn't mean the authors won't inadvertently break it by using something outside the portable subset.

It seems like for non-demo usage, you would want upstream maintainers to agree that their code should be TensorFlow-compatible, and have tests keeping it working.

You are correct that there is a social aspect to making julia packages work well. That's part of the reason so many julia packages are organized under various GitHub organizations to make sure that these kinds of discussions have a place to take place. However, I don't really see that as a negative thing. It gets package developers to talk to each other and delineate the abstractions for their packages more clearly. And in the end, I don't see it all that different from package development in any other language. Your users will always do things that you didn't intend and then you have to decide whether their use case is in scope for your project or not and act accordingly.

Aside: this is not TensorFlow, but XLA, which are two very different things. It's also possible to try this kind of thing and generate a TF graph, but TF is a much less nice compilation target.

Another thing I should have mentioned here is that Julia's multiple dispatch helps a lot with this problem, since you can provide specializations in the dependent package. So e.g. Flux.jl only needs to have generic code, CPU and GPU code and I can provide TPU specializations (where necessary - hopefully not often), for any function that needs it (yes, talking to upstream is required here to, to make sure they're aware we're doing it, but at least they don't have to maintain it).

Hmm, I guess that's okay as long as they don't change the code to call any new functions?

The set of functions that some generic code calls could be considered similar to an interface or trait in other languages. If they expand the interface (by calling new functions) then you'd need to make sure the new functions they call have the appropriate implementations needed as well.

In Go, for example, the interface definition would be explicit. You'd add a method to the interface (or perhaps define a new interface) and update all the implementations you know about. If there is any outside code calling it with their own implementation, they'd get a compile error.

It does sound rather convenient if essentially every function call allows for new implementations, though.

In TensorFlow one doesn't write Python code and run it on a G/TPU, one writes Python code that uses an API to construct a TensorFlow graph computation which is compiled and runs on the G/TPU. It's really a completely separate programming language with different semantics and a totally separate runtime hiding behind a Python API. This is why you can't use any of Python's normal libraries, you have to use TensorFlow libraries.

"Julia runs on GPUs" actually means that you write Julia code and it is compiled to natively run on GPUs [1]. Similarly, this post is about writing Julia code and compiling it to run natively on TPUs, not calling some predefined TPU library. Yes, of course you _can_ call libraries that are compiled for GPUs and TPUs—as you can in Python or any other language—but in Julia, nearly arbitrary user code can be compiled and run on GPUs and TPUs. The major restriction on GPUs is that you cannot do dynamic allocation, so you need to write non-allocating Julia code, but that's common requirement for high-performance code anyway.

[1] https://github.com/JuliaGPU/CUDAnative.jl

This has been available for Python going waay back to PyCuda, and also numba has support for jit compiling to a GPU target. You can do similar things with Cython as well.

Except for numba (when used on a specific subset of python), none of these utilities let you use a pure-python function with them. You can not write a generic python algorithm unaware of these tools and expect it to work efficiently with any of them without rewriting.

Totally false. Cython allows you to choose whatever subset of the CPython API you want to use, to suit whatever performance tradeoffs you’d like to make in exchange for access to convenience, dynamic attribute access, whatever.

You can write fully no-python C functions, all the way up to straight CPython dynamically typed code relying on the GIL and reference counting, etc.

I think you are missing the point. How do I get a pure-python function written by someone else to be "C level" fast when used in the inner loop of your otherwise superfast cython? How do you do that without rewriting it in cython?

Sure, cython is great if you decide to rewrite the library you want to use in cython. Same with scipy.optimize for instance: it can be super fast for everything but ordinary python cost functions.

Here is an easy way to prove me wrong: Find the minimum of a pure python function quickly, with whatever python tool you want, without having to deal with the enormous python function call overhead, without rewriting the function. Imagine this function is deep down in another library and it uses a bunch of 3rd party libraries.

I think you have a very fundamental misunderstanding here.

When you ask for a pure Python loop, for example, you are directly saying you want something like the iterator protocol of Python, inclusive of its overhead, because your use case needs the dynanic behaviors, overloading via custom iterators, whatever. You’re directly saying the performance trade-off is worth it for you personally because a low overhead C-style loop won’t give you the extra features of e.g. the iterator protocol that matters more to you.

When you say, but my use case doesn’t need the iterator protocol, then you just write that piece of code in Cython or use a nopython numba jit call, etc., because you want to make your trade-off differently in that single case.

You’re essentially saying, “how do I make a triangle with 5 sides,” by requesting a pure Python section of code to not be pure Python.

On a side note though, you can take pure Python code and compile it directly with Cython, with no modifications, and in many cases it will still be quite faster because it can compile away some types of attribute accesses, overhead of boolean logical checks as function calls, and also reduce function call overhead for many standard library functions or data structures that are already implemented as C extension modules (by bypassing some of the method lookup logic to avoid checks on PyFunction objects in CPython).

Usually this is a recommended first step before ever adding a type annotation or doing anything more difficult.

Finally, just to be clear, numba (and tools using llvmlite more generally) don’t worry about this, since they perform static type inference and have other rules about enforcing static types when jitting a Python code segment.

"Here is an easy way to prove me wrong: Find the minimum of a pure python function quickly, with whatever python tool you want, without having to deal with the enormous python function call overhead, without rewriting the function. Imagine this function is deep down in another library and it uses a bunch of 3rd party libraries."

Have you tried that yet?

That question is bizarre and just points to more serious confusion on the part of the commenter above.

If you’re writing a general purpose function minimizer, you have to make assumptions about the functions you will minimize, for example that they are bounded below or that minimization is restricted to a closed subset of the domain.

If they rely on third party libraries, you have to make even more assumptions, that those third party function calls don’t require (for legit reasons) some Python-specific dynamic typing feature of the language that would imply that jit compiling them is impossible.

If you are willing to make these assumptions, then solving this is trivial: just monkeypatch all the relevant functions with jitted versions of the functions. You could even write a tool to walk the relevant packages with pkgutil and monkeypatch everything. Zero rewrites, and you wouldn’t even have to manually indicate what to monkeypatch.

For example the library gevent does this “monkeypatch everything” pattern to change all functions in the requests package into non-blocking async requests automatically.

If you are asking instead to preserve pure Python features that these third party libraries are using, like reliance on iterator protocol, descriptor protocol, metaclasses, context managers, Python data model special methods, dynamic attribute lookups, etc., then the question is once again not logically consistent. It’s asking for a triangle with 5 sides.

"these third party libraries are using, like reliance on iterator protocol, descriptor protocol, metaclasses, context managers, Python data model special methods, dynamic attribute lookups, etc.,"

Julia can Jit code that relies on vastly improved versions of all those features (minus some setatr etc).

Can you use them on a custom array type in any case let alone scipy? No. and that's the point. In julia, the optim package for example takes in abstract arrays with all those features. Mathematical assumptions are categorically different than unavailable or slow PL semantics.

Julia preserves full general programming language semantics including zero cost abstractions, higher order functions, closures, zero cost differentiation and extending custom types, while still being fast enough for numerical computing (minus exceptions).

It seems to be that Python is sticking with 1d figures and that's assuming your monkey patch idea works, also ignoring the ecosystem cost and the fact that monkeypatching in Julia is part of normal code design (through multimethods) whereas in python it's a code smell.

I think you are really grasping at straws here.

> “Julia can Jit code that relies on vastly improved versions of all those features (minus some setatr etc).”

This basically summarizes the problem with most Julia upvote party posts like this one on Hacker News. Your comment is totally one-sided, Julia is better at every possible thing, so much that you are noseblind to it and can’t get an outside perspective that no, in fact, Julia’s language features do not have some fully dominating feature by feature parity compared against Python.

Every time it’s just an agonizing dragged out comment thread full of this type of overly one-sided thinking. Usually I just ignore all Julia posts for exactly this reason, and probably should have this time too, but seeing Python jit options and Cython options misrepresented so badly just got the better of me.

You are using now denigrating comments instead of answering simple questions. You are not even explaining what is wrong with the quote you took from the previous comment. The poster never said "Julia is better at every possible thing", but you are totally saying that about python by pretending it is not a chore and a difficult learned skill to write fast python numerics.

Yes, we all know that if you program in a very particular way (basically by not using any of the great dynamic or introspective features) you get fast python. How is it not objectively better to have a language that is fast independently of whether you use its dynamic/introspective/metaprogramming features?

"Python is fast as long as I program in this very particular and very constrained way" is a silly way to defend python (which is nonetheless an amazing language).

Julia has a ton of "zero cost abstractions". Python, as great as it is, simply does not.

This is treading a fine line here, since you don’t mention Numba, which applies the same approach as Julia, namely translating the language to LLVM IR for generating PTX.

The same applies to those OpenACC pragmas that can offload a butt ugly Fortran loop to GPU: no one says Fortran is running on the GPU, rather the compiler is doing code gen and RT calls to make user life easy.

It thus smells like marketing rhetoric.

Numba compiling Python to PTX is absolutely "Python running on GPUs" and OpenACC is also "Fortran running on GPUs". If user code written in language X is compiled to target hardware Y then that is "X running on Y". This is fairly standard compiler terminology, not marketing speak. You specifically mentioned TensorFlow, which is NOT Python running on G/TPUs since the code that has neither Python's semantics nor its runtime.

I had a similar experience trying to use an LSTM model / TensorFlow based on the Shakespeare RNN example.

This doesn't generate novel, meaningful text content without a template, but I'm unaware of any machine learning model that does this well (OpenAI's included). This model does absorb the 'rules' of the text, both grammar and structure (character names and dialogue, scene introductions, etc)

In my own project, I made use of this to find spelling and grammatical errors on Esperanto Wikipedia: https://medium.com/@mapmeld/esperanto-nlp-part-3-correcting-...

It depends on the dataset. Char-RNNs do pretty well compared to Markov chains when used to generate smaller texts. (which is why I've become less of a fan of the Shakespeare example the original Karpathy post used).

How large is the project? I'd like to clone it to my drive account, but not if the data sets are too huge.

The notebook is about 400kB. It also downloads a 4MB dataset and julia, tensorflow and dependencies need about 400MB, but that gets stored in the ephemeral Colab VM, not on drive if you copy it there.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact