

What do Data Scientists use to train models fast? - moridin007

i&#x27;m training a machine learning model using SVM in python and it took aages for it to happen on my local machine. (with 10% of the data that i have)
i&#x27;m getting a 80-90% correct prediction score on the same subjects data so now want to add in the rest of the data. (11 more subjects)<p>i thought of offloading it to my ec2 instance but i&#x27;m on a budget so cant just take a 30CPU instance..
on top of everything the code just uses 1 CPU at 100% always so i&#x27;m not sure about how effective it would be.<p>what do you guys use to train these models?
======
syllogism
Speed comes from two things: implementation and algorithm. Algorithmically,
the way to learn quickly is to use some sort of stochastic gradient method,
i.e. learn from examples one-by-one, as opposed to as a batch.

As far as implementation goes, you need dense arrays. A native Python
implementation will usually be lists of Python objects, which is very slow.

If you just need an SVM implementation, libsvm is pretty good. I'm assuming
you need a non-linear kernel. If you're using a linear kernel then there's not
really a difference between SVM and MaxEnt (well, there is but not much).

If your data is very sparse then there aren't many general-purpose
implementations that are any good. The scipy.sparse module has some key stuff
implemented in pure Python, and doesn't interoperate properly with the rest of
the PyData ecosystem. I had to implement my own sparse data structures, in
Cython.

------
facorreia
One approach is to convert the code to use parallelism. For an example of how
to do it in Python using joblib see this article:
[http://blog.dominodatalab.com/simple-
parallelization/](http://blog.dominodatalab.com/simple-parallelization/)

Even if you can't afford a 32-core instance, you might get to use 4 cores in
your laptop.

~~~
syllogism
The problem's pretty obviously the Python implementation...Throwing more cores
at it isn't really going to help.

Is it even easy to parallelise SVM training?

~~~
moridin007
so the solution would be move from python to go? like python isn't that good
for ML algorithms?

~~~
syllogism
Python's fine if the heavy lifting is in a C extension. I use Cython for this;
others prefer just numpy, maybe something like numba. But you can't just have
a Python list of floats.

Java, C++, Scala and Julia are all popular choices. Go is probably fine too,
although I know less about it.

~~~
Lofkin
Actually list support in numba is WIP

------
rajacombinator
How much data and how long are you talking about? If it fits in memory, then
the slowness is likely due to other coding errors causing a bottleneck, not
the SVM training. (Unless you wrote that as well.)

