Hacker News new | comments | show | ask | jobs | submit login
Yanni – An artificial neural network for Erlang (ikura.co)
203 points by _nato_ on July 14, 2017 | hide | past | web | favorite | 71 comments

It uses `array` (somewhat mutable Erlang structure) and NOTP (no idea how it makes code "zippy" and the repo [1] does not explain anything... it seems to be bypassing the normal way modules are loaded?).

I am unsure why anyone would use Erlang for number crunching. Training neural nets is basically just multiplying big matrices. I was hoping this project would come up with an interesting approach (how about using SIMD on the binary comprehensions that can use it? now that would be cool) but performance / memory usage does not seem to be looked at here.

It is naive / uneducated to think that "Erlang’s multi-core support" + distributedness will enable many things for you. How does the VM scale on 32, 64 threads? Have you tried making a cluster of 50+ VMs? Unfortunately Erlang Solutions Ltd.'s marketing has hyped many.

I am not against projects like these, I am just looking for reasons behind the choices made.

[1]: https://bitbucket.org/nato/notp/src

> I am unsure why anyone would use Erlang for number crunching.

I've talked to a few people from the financial world who to trading with Erlang. They use it because it makes to easy to take advantage of multiple cores.

And like others pointed out, if there is a need to optimize things, they can quickly build a C module and interface with it but they didn't need it so far.

I wrote a long response to this arguing that I thought Erlang was a terrible choice for neural net training, and ended up coming to the conclusion that if you architect intelligently (i.e. you're not passing data around between Erlang processes with any frequency because that's disgustingly expensive, you're optimizing your process count to your compute architecture, you're doing i/o as infrequently as possible, etc.), Erlang is probably a pretty good choice. I'm not sure if you avoid more foot-guns than you create, but I don't know, I can see it.

At the end of the day, anything that lets you bang away uninterruptedly on a processor (no context switches, no cache shenanigans) seems like a suitable implementation.

And of course you get to write in a fun language that is amazing for other use-cases.

Does OTP 20 help with the object passing inefficiencies?

Not in the general case, it only removes copying of literal (constants known at compile time).

You can however add things to the constants pool using [1] for example :)

[1]: https://github.com/mochi/mochiweb/blob/master/src/mochigloba...

i am curious why traders don't look to labview for some of their work. multi-core support is inherent to the language and fpga programming is just a step away with the same language. crunching away on things in parallel is what it's great at.

Because it is awful.

Unless you are a non-programming researcher that learned labview and needs to work on fpga, you should just use something else.

IMO it's worth learning systemverilog even if you're in that position; labview has so many 'gotchas' and is so gross for anything large, I think it is never the right answer.

It's not the parallel type of multiprocessing which Erlang is good for, it's the concurrent type. The platform is largely based around message passing.

Unfortunately this comes with tradeoffs—relatively high throughput compared to synchronous computation at higher latency.

Of course that's fine for many workloads.

I'm waiting for the time we finally realize the obvious best model for pipelined SMP applications: rescheduling the next required process to the core where the data are cache local.

'Work-stealing' schedulers already do this - jobs are scheduled onto the core which created them and presumably touched their data last, unless there is load imbalance in which case other cores take jobs. I don't know about the internals of Erlang but I'd be surprised if it was not already work stealing as it's the usual technique.

As far as I'm aware, most work stealing schedulers still aren't cache-aware. One really naiive (but possibly effective) way to do this could be to have a per-core (or per L2, or per NUMA node) work LIFO which would be consulted before looking to other cores for work. When your core/L2/NUMA node schedules a task right before terminating, it is more likely that the next task will be local. This, of course, doesn't work if you're more concerned about jitter or latency under load.

I noticed a paper about a cache-aware work-stealing scheduler which I have not yet read[0].

[0]: https://www.researchgate.net/publication/260358432_Adaptive_...

Frankly I believe that Intel could sell processors now with ten times more cache per core, and the queue for them at $50,000 a socket would be just immense.

I probably underestimate the likely cost by several times and then the cooling would be a great science fiction set properly to 1:12 scale, but I certainly know businesses who have a real desire for a product like that.

Am I missing a showstopper preventing the possibility? I'm not going to be persuaded that it couldn't be done by mere impracticalities. I'm quite prepared to take heatsinks the size of Cantelupes...

The problem with huge caches is actually that access latency grows with the physical distance of the cache lines from the pipelines.

This is why typically you see them adding new cache levels, instead of drastically expanding the size of the cache, especially the lower level caches (L1 and L2).

But have you ever actually tried to implement anything complex in LabVIEW? I have (didn't have a choice due to environment limitations) and the resulting monstrosity was not only slow, but impossible to refactor or maintain.

I ended up rewriting key components in C# just to speed it up and make maintenance bearable.

i have. everything i have done in labview has been on major, complex projects that have all exceeded 1,000 VIs. speed has never been a limitation for me other than UI updates, but that is true in most languages. and in that case it was because of a huge amount of data being streamed to the UI.

Yeah Erlang is slow as hell for number crunching but then when you see this:

(Handbook of Neuroevolution Through Erlang) https://www.springer.com/us/book/9781461444626

You're like huh. There's people that's doing it. I don't know how but yeah. I can't even crunch prime number on it without giving up cause of how slow it is.. I guess there's more to it.

> I am unsure why anyone would use Erlang for number crunching.

Good point, I wouldn't use it in production. But I think it can be a great educational tool to learn about implementation of NNs and overall topology.

In erlang you an interop with c++ libraries using NIFs so maybe the author down line will move to heavy matrix operations to a NIF

Rust would also work well here with added memory safety bonuses (if rust is used well).

Especially nice is the Rustler tooling for building a NIF.

In Erlang you cannot run non-BEAM code for more than ~10ms or your VM will crash. GEMM will be hard to use this way...

You can use the dirty scheduler feature in 17.10 to get around the limit (because it creates unmanaged threads).

See https://github.com/vinoski/bitwise

But honestly, you're better off with just Don't Do That. It's not what Erlang is meant to do. It's a suitable language to coordinate number crunching if that floats your boat, but it is not a suitable language for the actual crunching.

Some people still seem to get very upset when someone proposes that some langauge is not suitable for some use, but there aren't any languages that are the best for everything. The languages in which I would want to write heavy-duty number crunching code will be nowhere near as good as Erlang at writing high-concurrency, high-reliability server code.

Also, to avoid a second post, contrary to apparently popular belief it is not possible to just make up for slow code by bringing in a lot of CPUs. Number crunching in Erlang is probably a bare minimum of 50x slower than an optimized implementation, it could easily be 500x slower if it can't use SIMD, and could be... well... more-or-less arbitrarily slower if you're trying to use Erlang to do something a GPU ought to be doing. 5-7 orders of magnitude are not out of the question there. You can't make up even the 50x delta very easily by just throwing more processors at the problem, let alone those bigger numbers.

You're not writing the number crunching code in Erlang, in this example. You're using Erlang to coordinate that number crunching via NIFs (C programs, managed and read from by Erlang). Dirty scheduling enables NIFs that run for longer than 10ms.

You're right. Please never implement your number-crunching in Erlang. It will be slow.

> Don't Do That.

That's exactly what dirty schedulers were made for - run longer blocking C code but without having to do the extra thread + queue bits yourself.

> is not possible to just make up for slow code by bringing in a lot of CPUs.

It entirely depends on what you are doing. So number crunching could be a small part amongst lots of protocol parsing and binary matching and sending to different backends and so on. Rarely it is just purely a simple executable that runs and multiplies a matrix and exits. In that context it could make sense to start with Erlang and then do dirty scheduler or an drivers or such for number crunching.

You can:

* Use dirty schedulers. Those are available since 17 and in 20 are enabled by default

* Use a linked in driver and communicate via ports.

* Use a pool of spawned drivers (not linked in) and send batches of operations to them and get back results.

* Use a C-node. So basically implement part of the dist protocol and the "node" interface in C and Erlang will talk to the new "node" as if it is a regular Erlang node in a cluster.

* Use a NIF but with a queue and thread backend. So run your code in the at thread and communicate via a queue. I think Basho'd eleveldb (level db's wrapper) does this

* Use a regular NIF but make sure yield every so many milliseconds and consume reductions so to a scheduler looks like works is being done.

A native-implemented function (NIF) should always return within 1ms. If something takes longer, it should be made a port.


FUD, long running NIFs won't crash your VM, just block scheduler cores and mess up the real-time-y-ness of your application.

Ports are the conventional option for long-running native code, but I don't really know the performance implications.

Ports are more about communicating with and managing the lifecycle of external processes in a structured way, if I understand correctly (I have lots of experience with NIFs and next to none with ports). It's not quite a FFI.

In the context of NIFs, long running could mean 100ms. The NIF API supports threads so you can always pass work to a thread pool and only block the scheduler thread for as long as it takes to acquire the lock on your queue. Or use dirty NIFs which I think are no longer experimental. There's also a new-ish function you can use to "yield" back to Erlang from a NIF but that kinda makes me nervous.

This is a neat idea but it would be great if there was a bit more substance to the post. Do we have any performance benchmarks? Why would I consider it a strong contender? Stating "multi-core support" to me is not necessarily scaling.

I'm in no way an expert, but I work in Erlang in my day job and just glancing at the repo, this solution can't possibly be performant. A) Erlang is slow at math. B) Arrays don't have O(1) access(ETS tables might be able to help with this). C) You can't scale this solution with more Erlang nodes(without some additional work).

I really like Erlang and want to evangelize it but I don't think this is a good way of doing it. I only see this as a neat toy but not a selling point for using Erlang..

As a side note: I noticed the repo has a feature note about adding NIF's for performance bottlenecks (native C code for Erlang to talk to). If you end up writing C code, then what are you gaining from Erlang?

My favorite ML book is still the Handbook of Neuroevolution Through Erlang. A bit pricey but you can borrow my copy if you're in SoMa.

Interesting. Can you provide more details please. Why do you recommend this book? What sets it a part from other books? What's the best thing you like about? What is your own background (skills, education)? Only asking to make a buying decision given the price.

In many ways it's a companion to this codebase:


I have been thinking about picking this one up for a while, do you think it would be helpful for learning ML in general, or should I start somewhere else?

It's definitely a very niche book; I would definitely focus on statistics and Bayesian methods before delving into genetic algorithms for evolving neural networks.

With the latest revision I keep on getting this error:

    src/yanni_trainer.erl:78: type array() undefined
It was introduced in revision 20.

    3 files updated, 0 files merged, 1 files removed, 0 files unresolved

    [bherman@archy yanni]$ hg update 19

    [bherman@archy yanni]$ make

    rm -f notp notp.boot

    rm -fr ebin

    mkdir ebin

    erlc -o ebin src/*.erl deps/*/src/*.erl

    src/yanni_lib.erl:77: Warning: random:uniform/0: the 'random' module is deprecated; use the 'rand' module instead

    src/yanni_trainer.erl:112: Warning: random:uniform/0: the 'random' module is deprecated; use the 'rand' module instead

    deps/*/src/*.erl: no such file or directory

    make: *** [Makefile:6: default] Error 1

    [bherman@archy yanni]$ hg update 20

    3 files updated, 0 files merged, 0 files removed, 0 files unresolved

    [bherman@archy yanni]$ make

    rm -f notp notp.boot

    rm -fr ebin

    mkdir ebin

        erlc -o ebin src/*.erl deps/*/src/*.erl

    src/yanni_lib.erl:77: Warning: random:uniform/0: the 'random' module is deprecated; use the 'rand' module instead

    src/yanni_trainer.erl:78: type array() undefined

    src/yanni_trainer.erl:118: Warning: random:uniform/0: the 'random' module is deprecated; use the 'rand' module instead

    make: *** [Makefile:6: default] Error 1

        [bherman@archy yanni]$ make
edit: formatting

I wonder if the author is running an ancient Erlang? The code looks very old school, with the lack of maps, no rebar, deprecated functions, etc.

From Erlang release notes:

The pre-defined types array/0, dict/0, digraph/0, gb_set/0, gb_tree/0, queue/0, set/0, and tid/0 have been deprecated. They will be removed in Erlang/OTP 18.0.

Instead the types array:array/0, dict:dict/0, digraph:graph/0, gb_set:set/0, gb_tree:tree/0, queue:queue/0, sets:set/0, and ets:tid/0 can be used. (Note: it has always been necessary to use ets:tid/0.)

Looking forward to a Elixir wrapper for this.

There isn't any need for that. Hex is becoming the de facto package manager for the Erlang ecosystem. You can easily install this as a dependency and call it from your Elixir code.

Oh sweet, I wasn't aware of this.

Or just call it directly.

They should use this as the photo for the website ;) https://upload.wikimedia.org/wikipedia/en/b/b1/Nightbird_%28...

And don't forget! Make sure to read the intro with this in the background: https://www.youtube.com/watch?v=JfSSQ3Vejao

The first Yanni conference will be "Live at the Acropolis" :)

Isn't Erlang good for concurrency but bad at math performance?

Yes, as many have pointed out. It's ideal for coordinating a cluster doing machine learning, but ANN code itself should not be in Erlang.

> It's ideal for coordinating a cluster doing machine learning, but ANN code itself should not be in Erlang.

It's funny that's what they did with disco (http://discoproject.org/).

Python + Erlang = Hadoop 1.0 like.

I really wish there was a way to do inline optimized code, i.e a gen_server that transparently wraps another language without having to get into nif / external servers. Basically an abstraction that hides all of that cruft and builds optimized gen_servers for doing number crunching or heavy processing. Maybe I'm just being lazy. :)

erlexec (or its elixir wrapper) might be worth a try: https://github.com/saleyn/erlexec

I've used this, it's an awesome project, but it's basically just pipes all the way down. I'm thinking of something closer to a NIF. I think Saleyn's c++ node code might be the closest thing for a lower level language.

I'm not sure how I feel about a ML project using the name of one of the greatest instrumentalists in history.

Yeah, that's more of a web framework thing.

Nice work ! At Dernier Cri, we begon a similar work : https://github.com/derniercri/multilayer-perceptron But we were far less advanced than you!

If you use one Erlang node or whatever per neuron, it's gonna be slow as f*ck.

In Erlang you usually run one "node" per machine, though you can call between them with transparent RPC.

This library uses one "process" per neuron for concurrency. Processes are extremely lightweight and entirely unrelated to system processes or threads.

Running one process per neuron is going to be extremely slow.

Erlang's inter-process messaging is ridiculously optimized. Processes are extremely low-weight, it costs approximately nothing to start and stop them. This is one of the core strength of Erlang.

Running one process per neuron would actually be a very efficient way to do it.

I'm quite sure that would be a grossly inefficient approach. Sending a message is expensive in Erlang, less so than in other languages, but it's still very large compared to a few math operations. It's a common mistake to use processes to represent objects [1].

The recent article from Discord [2] also mentioned "Sending messages between Erlang processes was not as cheap as we expected, and the reduction cost — Erlang unit of work used for process scheduling — was also quite high. We found that the wall clock time of a single send/2 call could range from 30μs to 70us due to Erlang de-scheduling the calling process."

[1] http://theerlangelist.com/article/spawn_or_not

[2] https://blog.discordapp.com/scaling-elixir-f9b8e1e7c29b

It should still be pretty slow compared to just doing the neural net propagation as matrix operations

I wonder if you could make computation run faster by configuring the network to cut up the work into gpu vs non gpu work and have each node efficiently process the work and then have the results reassembled.

Each neutron as an individual process is indeed interesting but no so much if your using backprop for the training algorithm as it doesn't really fit the paradigm.

What is the 1-to-1 mapping between the Erlang concurrency model and neural networks?

Each node in a NN can be represented as an Erlang actor / process. They communicate via messages.

and it also plays a sexy clarinet??


It's a amazing how minimal prior knowledge and a quick google search totally trumps someone who's spent years working in a field.


Are you serious?

Some people are just deeply unhappy, and it comes out like this.


Please stop.

Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact