
Software design patterns for Machine Learning R&D - smalieslami
http://arkitus.com/PRML/
======
glimcat
This is all very good advice which will save you much stress and many wasted
hours.

I would add:

Document any reference material you use, including the source and why you're
including it. Cache any digital content, either in the project path or using a
management tool like Zotero.

Keep a research log. Minimally, annotate your trials. Coming back even a week
later and trying to figure out which trial was done on what hunch with what
results is extremely time consuming without this information.

------
ballooney
The use of 'Patterns' in the title is getting some heat but I think it's
because his title 'Patterns for Research in Machine Learning' is a little play
on 'Pattern Recognition and Machine Learning' (usually known as 'PRML' or
'Bishop') which is a text book by Chris Bishop, arguably the bible in the
field.

This is good advice, especially saving intermediate calculations to file which
can make iteration much faster. I have witnessed a lot of research students
set a job running which will take about an hour, look at the results, say
'd'oh!', change one line of code in one of their functions and set the whole
monolith running again, needlessly repeating about 55 minutes worth of the
hour's computations.

------
bravura
I have a handful of these too. (Like ecesena, I think they're "best
practices", not design patterns.)

It might be useful to put up a wiki so people can discuss. Even something
simple and ugly like c2.

For example, handling hyperparameters is actually a topic in itself.

------
textminer
Please, please, please spend a day abstracting out commonly written functions
to one place. Your quickly-written prototype code should not be slowed down by
the 100th slightly different implementation of a text tokenizer or kNN
classifier.

~~~
smalieslami
This is very good advice.

It almost always makes sense to include Tom Minka's Lightspeed toolbox
([http://research.microsoft.com/en-
us/um/people/minka/software...](http://research.microsoft.com/en-
us/um/people/minka/software/lightspeed/)) right from the beginning.

Also perhaps Netlab
([http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/n...](http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/downloads/))
although it is beginning to get rather dated.

~~~
textminer
Hardest part of implementing a high-functioning production machine learning
stack for me isn't the idea-articulating, prototyping, iterating, then
polished refactoring. It's knowing when to go from a quick-producing language
like Python/MATLAB/Julia to something painfully-written but smooth like C++ or
into a scalable Elastic MapReduce or Mahout process (the former of which,
sure, is language-agnostic).

You can only spend so much time optimizing on memory/CPU-times with smart data
chunkings or low-dimensional representations or approximation operations. EC2
time and space is relatively cheap, but Python on a single machine with the
multiprocessing module can only speed up by a multiple of < [# of Cores]...

~~~
beambot
You should checkout PiCloud [1] or MrJob [2]. Both of these seek to make
MapReduce dead simple using Python. More importantly, you can do all your
testing on a Desktop PC. Then when you need real horsepower, you just tell it
to "spin up on EC2."

 __Disclaimer: I have yet to use either, but I've heard good things.

[1] <http://www.picloud.com/> [2] [http://musicmachinery.com/2011/09/04/how-
to-process-a-millio...](http://musicmachinery.com/2011/09/04/how-to-process-a-
million-songs-in-20-minutes/)

------
ecesena
Misleading title: best practices, not design patterns.

------
xaa
Great post! I have also found useful:

\- Record the (Git) revision number of my code for each run.

\- Use GNU make to manage the pipeline of downloading, training, evaluating,
etc.

------
karavelov
This are very sound advices. I have arrived at very similar architecture in
LSI environment.

From my experience, here are some advantages of this architecture:

\- stages could be independently rewritten, so you could prototype in fast-
writing language (perl in my case) and later rewrite whole stages or parts of
them in fast-execution language if you need extra performance (C,C++ here);

\- you could easily integrate third party software in your workflow - most of
the existing tools in the field work with input and output files;

\- you could reuse already written stages for different purposes - just pass
them different options for input/output and parameters.

------
paulbunn
Was a bit disappointed - this is generally just good advice/best practice for
any sort of programming task not just machine learning R&D.

It is however very good advice taken in the correct context.

