Hacker News new | comments | show | ask | jobs | submit login
Machine Learning Library for C++ (diku.dk)
76 points by optiminimalist 1607 days ago | hide | past | web | 50 comments | favorite



Very interesting, but as a daily practitioner I am skeptical.

First, this is a lot of code! As a C++ machine learning programmer, I am impressed as I know the pain (someone explains why, see comment https://news.ycombinator.com/item?id=5613797 ).

Second, it contains a version of Blas and ublas as well as LBFGS and more, much more, coded from scratch as it seems. This seems too much for an ML library, and a lot to maintain.

This makes me skeptical of performances and maintenance of the code, but it would be fairer to try it first.

Still very impressed.


Hi, shark developer here.

First of all, we are glad that our library is discussed on this board! We are happy for every feedback we can get!

Regarding Performance: we try to get the key algorithms as fast as possible. And for the hardest parts we rely not on ublas, but use optional bindings to ATLAS. Speed was one of the key design criteria. We hope that we achieved that. Clearly this is no guarantee that every algorithm is fast, but in this case: just add a ticket!

Please bear in mind, that Shark is still in beta stage, and we are heavily developing it (I am right now working on the family of multi class SVMs). So for example parallelism using OpenMP is not fully integrated.


You should consider using Eigen for linear algebra, I have personally found it much better performance wise than using bindings to ATLAS or other more standard linear algebra solutions. ML algorithms tend to be multistage (think about the update of weights with momentum in a Neural network for example), and the primitives available in ATLAS or a blas library are really too low level. Eigen since it generates code for a whole complicated expression can blow a standard linear algebra library out of the water for a certain class of problems. For others obviously the highly tuned vendor BLAS code would win, but I've seen huge speedups by using Eigen it fits well for complex ML operations.


This is true, even though we would rather switch to Armadillo due to it's easier handling and better high-level behaviour.

Right now the linear algebra library we use -ublas- has the same behaviour as Eigen for BLAS1 type expressions. So it tries to generate optimal (non-SSE) code. Only for BLAS2 and 3 we fall back to the ATLAS-routines which has the same performance as Eigen on the interesting problem sizes.

//small edit In the end it is not so interesting whether the BLAS1-type expressions are fast as they make up < 1% of run time performance. The big chunks are the data processing inside the matrix-matrix multiplications of the Neural Networks and similar entities.


You forget that if you can do the whole weight update in a single shot operation that the data doesn't have to go through the cache multiple times, and at least on the problem sizes I am working on FP bandwidth isn't what kills it, but the memory bandwidth. Back to the NN example: If you can do the matrix multiply and the application of the delta weights in a single loop iteration you get much better cache behavior.

Another thing about code generation, I am also using a hacked version of Eigen as well in a project I'm working on that can do the tanh and derivative of the tanh so the NN activations go quite abit faster since you can generate vectorized code for the whole calculation that will visit the memory location exactly once. While true the calculation of the weight updates is the most time spent, I saw 3-4x speedup in the activation code doing it in a single operation due to better memory access patterns and less loop iterations. Better memory access patterns can also have synergistic effects on other code because there is less cache pollution happening. By being fast and loose and introducing a few other copies of the matrix data in my case, my performance falls off a cliff when it no longer fits in the cpu cache nicely. 10x difference in the particular case I am remembering.

As always performance is part art, part science and perhaps it won't matter as much for the general case, but for my specific implementation and my matrix sizes Eigen has made a measurable difference for me compared to other solutions.


FYI, another bonus on Eigen is if you are using a Matrix-Matrix operation you will get multithreading with OpenMP for free. It isn't the most efficient way (you can do better threading yourself probably), but it gets most of the way there for stuff that will use GEMM operations.


Amazing work, thanks for sharing. Any hint on how it behaves on big datasets ? Over time I've found that scaling up to real world (and industry) datasets requires going custom (or distributed, a-la-mr / hadoop).


Unfortunately, we don't have much experience with industry sized datasets - simply because we don't have them. I know that our SVMs are among the fastest of the world. At least we beat Libsvm and Liblinear(and there certainly on very big datasets!). But we lack support for hadoop/mpi, even though we would like to change that in the future.

Right now I would say that the main focus of shark is research oriented. That is we want to be fast but also modular so that we can still easily exchange different aspects of the algorithms with our own work. As these goals sometime clash, it is hard to claim that we are the fastest, simply because there is for nearly every algorithm some way to improve when you know exactly which combination of model, loss function and training algorithm you use. But we are (hopefully) reasonably fast and certainly want to improve.


May I ask how do you become a C++ ML programmer? It is not an usual position but one I would definitely be interested in pursuing


Sure, I started as a researcher 14 years ago, then drifted to what I thought was a sweet spot then, half-research / half-programmer. I say 'sweet spot' because many applications did require both the academic and the applied background at the time, so for the sake of thrilling applications, it was worth 'downgrading' to pure engineering work when needed. Now I believe the game has changed a bit, coursera and others are infusing the minds of engineers with highly technical knowledge far more rapidly than before.

Typically I am astonished at the number of implementations of deep learning techniques (Shark does include some, AFAIK).

My past experience is that I had write many AI algorithms myself because I could not find any suitable, free and/or open implementations (or other researchers would not share theirs ;) ).


Thanks! That's very inspiring :) I worked in a similar position for a short time long time ago but after that I hadn't been able to find something similar. Good thing to see that people can find those sweet spots! :D

It's very true what you said about all the implementations available now. Although, for most algorithms I tend to try to implement them myself as a learning experience, maybe the biggest exception is standard SVM since it is kinda tricky but even for that there are some online algorithms that are easy to implement.

Thanks for sharing your experience!


The library supports some very useful algorithms. Both supervised and unsupervised ones. But I can't use it for our commercial non GPL projects.


Flame Suit On / Rant Mode On

I actually really don't understand why anyone uses GPL for a library. I've been doing open source for a long long time, and love the GPL. I have code in the Linux kernel, and believe free software AND open source software are great solutions to very real problems in software engineering. Having open code just gives people more options, and I firmly believe it will win over time as far as quality is concerned.

I just think only providing libraries to other GPL code is stupid. It just limits the usefulness of the software. LGPL is great here, you get the core changes contributed back to your library from a greater group of people and everyone wins. Limiting a library to GPL means a large population can not use your code, those writing applications that can't be licensed under the GPL. Limiting choice is BAD. The whole reason you should be creating and using free software and OSS is to not weld the hood shut. GPL should be for applications, LGPL just limits choices for libraries. Down with the GPL for libraries!!!

Flame Suit Off / Rant Mode Off


Limiting choice is BAD. The whole reason you should be creating and using free software and OSS is to not weld the hood shut.

But you sound like you want to limit choice for users. The whole point of free software from the GNU perspective is to keep options open for users, and not allow downstream devs to "weld the hood shut" on derivatives by adding more restrictions on what users can do with those derivatives.

However, there are other reasons people choose it as well. One motivation you sometimes encounter is a view that, if someone's code is used in proprietary, commercial software, they'd like to be paid for it. Hence the dual-licensed model used by libraries like Qt and the Stanford Parser: you can use the GPL version if you're willing to GPL your own app, or you can buy a proprietary license if you aren't. Seems reasonably fair: I give you my code free if you reciprocate and do likewise with your own code, or I sell you a license for cash otherwise.


If someone is wanting to go with the full GPL I think the AGPL would be a better fit anyway otherwise people can use a webservice to do all the processing. Could even charge for it and not give any in house changes back.


LPGL isn't vastly better than the GPL as it makes it difficult to link statically and/or release for closed platforms. Its advocates would probably see these as plus points, but I'm not sure that they're going to increase uptake.


LGPL basically means that if I modify the libary code, I should open source the changes, but I can build derived products (in GPL sense) from an unchanged lib as much as I want. This static vs dynamic linking debate is a silly pendantry, enforcing which doesn't contribute anything to the GNU's goals or vision. It's really disappointing to see people time on this. If you want to pursue justice - go after the GPL violations. Making one to re-link a binary to force a compliance is a misplaced effort and essentially a waste of everyone's time.


Requiring the programmers to keep their changes to the library open is one aspect of keeping the software free; requiring that the end user be able to replace the library with a different one is another, and one that the FSF seems keen on maintaining. Personally, I agree that wider-spread usage of the libraries would be better than the current rather doctrinaire stance - but I assume the FSF sees the LGPL's current emphasis on end-user freedom as contributing more to the GNU goals than wider-spread usage of the libraries would.

In the meantime, there's always the MIT licence.


Difficult, but not really a significant part of the challenges for delivering closed source binaries across platforms.

On linux, you have to build distribution specific binaries that match the shared library versions in the package manager.

On Windows, you generally put all of your shared libraries in your application's folder, since there are plenty of bad actors who install DLLs without versions in the filename to system32. Leads to duplication on the system but its a generally accepted bad practice.

(Can't speak to shipping LGPL libs on OS X).


I was thinking more of Xbox360, Playstation3, iOS, etc. - none support user-replacable files. I don't even think any even support dynamic linking... )


Some people believe in software freedom more than you do.


Some people define "freedom" with 10 pages of restrictions of which the exact interpretation is continuously debated years after release. Other people consider that the opposite of freedom.


Some wouldn't mind their open-source software integrated into high-frequency trading systems and military munitions, and others either object on moral terms or wish to be paid fairly for such lucrative use.

Do you think programmers shouldn't have the choice to decide whether their software may be used to kill people or cause the next flash crash? GPLv3 allows programmers to share their software while making sure big corporations and defense contractors steer clear.


If LGPL works for you, you may want to look into http://www.mlpack.org/

I'm not really sure how they compare since I haven't really used either library, but there does seem to be some overlap.


Thanks, it's not as complete as Shark. But it is nonetheless very interesting.


Could you have the ML part done as a standalone process and not actually linked into your other project code? That is, process some data "out of band" so to speak, and write it out to a file, which is then used by "your stuff"? Or communicate with the ML part over a socket? Either approach would let you use this code without requiring the other parts of your project to be licensed under the GPL.

Just a thought...


Yes I think It is possible to use the library as a standalone process. It will introduce some little overhead. More importantly the code will be harder to debug and maintain.


This looks awesome. I've been itching to try out some ideas I have after having gone through Bishop's book, but I've been hesitant to write the algorithms from scratch. Now I'll have to decide between learning matlab or a library such as this.


You should definitely use a scriptable ML library. The process is very iterative and not suited to a compiled language like C++. I use skilearn alot, but also the matlab toolboxes or R are great. At its heart ML is alot of stats, so use something built for maths, not C++. It doesn't really make sense to break out C++ until you know exactly what algorithm and settings you need and your application is real time.


I am very agreed with this. I had even been thinking in making a DSL (Scheme based probably) oriented to ML instead of a library. I would found that more useful in the exploring phase.


If the OP was thinking of writing his own algorithms, and this is a linkable library with that heavy math already implemented, couldn't he write bindings for Python/Lua/Tcl/Ruby and have everything he needs for script-ability, or am I missing something?


You aren't missing anything, you are absolutely correct, but the question isn't "can they?" the question is "will they?"


I'm in the Stanford/Coursera machine learning course right now, and something like this is nearly excactly what I've been looking for.

As some others have said, GPLv3 is off-putting, but there is the LGPL mlpack lib (http://www.mlpack.org/) (also C++). Personally, project-wise, the only way this could be improved is if the project were pure C, and a BSD, MIT, or similar license. Quite looking forward to checking these out, though.


honestly you guys are crazy if you think shark is gonna help you learn machine learning. Its ideal for deployment of ML on things like embedded computer, robotics, games etc. where real time learning is required. Machine learning requires alot of experimentation and C++ is a terrible medium for that. There are loads of good machine learning libraries implemented for python and matlab. Pretty much every good paper in machine learning is accompanied by an algorithm implemented in matlab or python or R. learn using those reference designs. Once you figured out what you want, then deploy on a system in C++ by all means using shark. I do robotics for a living, and I do go from scripting to C++. Unless its absolutely necessary I avoid C++. Only things like vision which is so CPU hungry that it has its own computer do I require C++, every other algorithm stays in python.


Actually you forget that performance when you need to train for days at a time is critical, if I use Octave/Matlab/R my current project might take months to train instead of weeks. All my ML code is high performance threaded C++. I recommend you use a good template linear algebra library like Eigen, you can do plenty of experimentation in C++. I find with a set of a few modern libraries and the required experience a C++ programmer is just as if not more efficient than a Python/R/Matlab programmer. It comes down to the skill of the programmer and the proper choice of libraries.


True that matlab octave and R are all rubbish for performance. I use python + numpy which all delegates to BLAS for the hardcore linear algebra stuff. I don't normally find C++ gains me all that much. You can also do GPU acceleration pretty easy using theano (e.g. http://deeplearning.net/software/theano/tutorial/using_gpu.h...)

So I reckon my GPU accelerated python still beats a C++ pthreads approach, and is alot faster to develop on.

Your mileage may vary, from what you said you probably know what you are doing, maybe GPU is not applicable. I was really replying to the initial comments that said they want to start learning machine learning on a C++ system. Training for days suggests you are doing something hardcore like MCMC/DBN/Guassian Processes, learners should not start there though....


I'm doing deep belief networks with dropout, and don't have access to GPU's with good double precision performance. I used to write graphics device drivers, so GPU computing has a special place in my heart and definitely agree with you there performance wise. It is funny though that my little laptop is hitting training times similar to some papers where people are using low end GPU's though, its amazing what you can do when you pay attention to performance.

I suspect my tuned C++ code will work quite well on a Intel MIC, and that is probably where I'm going to go when I have more resources to throw at the problem. I do know that Theano does use Alex's C++ CUDA code under the covers and I have done lots of reading of some of theano's code looking at implementation details to help developing my code. I just am not a big python (or most scripting languages actually) fan, perhaps I'm just too old school and written C, C++, C# and Java too long. If it doesn't smell or feel like C, I feel like Scotty in Star Trek 4 when he was making the transparent aluminum on the mac.


[dead]


C++ isn't a terrible medium for experimentation. Games and every "creative coding" interactive projector art type project are done in C++.


Speaking as a PhD student in machine learning-

Implement the algorithm yourself, first, in Python+Numpy. The only reason I feel comfortable with Gaussian Processes and SVMs is due to writing code to solve them manually.

Once you're happy with the basics, and can test your ideas with code you intimately understand, optimise for speed by using a library like this.


Implementing the SVM from scratch was time consuming - no?


The only tricky part would be writing a quadratic solver. Alternatives: either solve a linear SVM using gradient descent (simpler to write), or offload the core of the algorithm to an existing solver like cvxopt.

edit: For an example of using cvxopt, check out http://www.mblondel.org/journal/2010/09/19/support-vector-ma...


Another approach is to implement Platt's SMO:

http://en.wikipedia.org/wiki/Sequential_minimal_optimization


cool - thanks


Yes it was :) but time well spent imo.

On reflection I guess I might have had more free time to spend on this than a normal person - I did the SVM as a [small] part of my masters project, so if you're time constrained with a real job and a life then might be best to disregard me.


If you had the quadratic solver, I would think it would be reasonable to add the rest of the code. If you started adding costs, gammas, etc. I would think it would take a while. I spent hours looking at the source code of libSVM at my last job and never really understood what the hell was going on


I do agree with you regardless, just curious


Does anyone know if there's a good ML library written in C# or F#


I don't know of any written in F#...maybe I'll have to write one :)

You might also ask on the fsharp-opensource mailing list, maybe someone has an F# ML library I don't know about:

https://groups.google.com/forum/?fromgroups#!forum/fsharp-op...


libsvm and liblinear are nice libraries for training SVM and linear classifiers. There seem to be two ports of liblinear for C# (I haven't tried them):

http://www.csie.ntu.edu.tw/~cjlin/libsvm/


how about numl? http://numl.net/

folks on SO also like WEKA run through IKVM (a Java to .NET converter): http://stackoverflow.com/questions/1624060/machine-learning-...




Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: