
Ask HN: Shell-Based Research Workflow? - 0db532a0
Yesterday I was trying to figure out how to scrape odds from a betting website. This sort of thing involves a combination of trial and error, and a lot of intermediate files which are easy to lose track of. Also sometimes you will overwrite a result produced with other command parameters which you later need again. You will also lose track of the command and parameters you used to produce a file.<p>I am wondering how to organise this sort of job better.<p>Here is what my job involved:<p>1. Use Python and Selenium to give me the source of a page after it finishes loading.
2. Put the source through xmllint, look at it in an editor, find a script tag where the odds lie in some very obfuscated JS.
3. Extract the text of the script tag with xmllint.
4. Put the output of that through a JS parser, and extract the desired odds after trial and error to find the right combination of parameters.
5. Use another tool to format the odds as a TSV, 6. etc., etc., etc.,<p>As an example, Mathematica has notebooks which allow you to organise your job and go back to previous results. Normally everything is ordered by time, but you can also rearrange things. Mathematica notebooks actually allow you to construct a tree out of your results.<p>You could also view my job above as a tree. Each call of a command takes some input files and gives an output file, which maybe is used by another command. Maybe I make multiple invocations of a command before realising that I want to go back to a previous result and use that as input for further research.<p>How do you organise your shell-based research jobs? Does it involve some sort of file-naming convention? Is there a GUI for this? Do you have some sort of shell-based tool which shows a nice tree relating your commands, their parameters and their result files? Maybe there is an Emacs package written for this? Do you put things in separate directories?<p>I&#x27;m interested to learn about your methods.
======
enkiv2
I usually put the process PID in the names of my temporary files, in order to
avoid overwriting stuff. (If you're liable to do multiple runs with the same
process, you can generate a random number at the beginning of the task -- or
get the date, if you really want to keep the results in order -- and use it as
a unique token.)

I first prototype operations as single command lines, then refactor those
command lines as I migrate components of them to shell functions (often taking
whatever unique token I want to use to distinguish runs). Eventually, when I'm
satisfied with the behavior, I'll write the shell functions to a script with
typeset -f, and then source the script later.

While it's not research-related (a lot of what I do at work is statistical
analysis on server logs, but anything I develop at work I technically don't
own & so can't post), a good example of this style is in my music synthesizer
project:
[https://github.com/enkiv2/clisynth/blob/master/clisynth.sh](https://github.com/enkiv2/clisynth/blob/master/clisynth.sh)
; all of the code in this file was prototyped at the command line.

~~~
0db532a0
I never considered defining functions as I go. That is a good idea. Previously
I've just relied on readline. typeset is a new find.

It might be good to have a function, which when called by another function,
generates a file name associated with the calling function's name, using
mktemp. It could even add the calling function's name plus its arguments to a
log file.

