I would add:
Document any reference material you use, including the source and why you're including it. Cache any digital content, either in the project path or using a management tool like Zotero.
Keep a research log. Minimally, annotate your trials. Coming back even a week later and trying to figure out which trial was done on what hunch with what results is extremely time consuming without this information.
This is good advice, especially saving intermediate calculations to file which can make iteration much faster. I have witnessed a lot of research students set a job running which will take about an hour, look at the results, say 'd'oh!', change one line of code in one of their functions and set the whole monolith running again, needlessly repeating about 55 minutes worth of the hour's computations.
It might be useful to put up a wiki so people can discuss. Even something simple and ugly like c2.
For example, handling hyperparameters is actually a topic in itself.
It almost always makes sense to include Tom Minka's Lightspeed toolbox (http://research.microsoft.com/en-us/um/people/minka/software...) right from the beginning.
Also perhaps Netlab (http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/n...) although it is beginning to get rather dated.
You can only spend so much time optimizing on memory/CPU-times with smart data chunkings or low-dimensional representations or approximation operations. EC2 time and space is relatively cheap, but Python on a single machine with the multiprocessing module can only speed up by a multiple of < [# of Cores]...
Disclaimer: I have yet to use either, but I've heard good things.
- Record the (Git) revision number of my code for each run.
- Use GNU make to manage the pipeline of downloading, training, evaluating, etc.
From my experience, here are some advantages of this architecture:
- stages could be independently rewritten, so you could prototype in fast-writing language (perl in my case) and later rewrite whole stages or parts of them in fast-execution language if you need extra performance (C,C++ here);
- you could easily integrate third party software in your workflow - most of the existing tools in the field work with input and output files;
- you could reuse already written stages for different purposes - just pass them different options for input/output and parameters.
It is however very good advice taken in the correct context.