
How to separate your data from your code - pplonski86
https://negfeedback.blogspot.com/2019/04/how-to-separate-your-data-from-your-code.html
======
dusted
Why in the world would you do any of these things as opposed to making your
program ask the filename from the user ? You've limited your program to
working on exactly one datafile.. If you just ask the user (any way is good,
but program arguments required by a long shot), then you allow them to
understand the files involved, and you get to run multiple versions of the
program against multiple files, to compare their output, or process your data
in parallel, or process different sets of data..

Sorry, not buying.

Also, the whole post could have been written as "Don't hardcode filepaths, use
a configuration file." but again, that's not what configuration files are
for.. If you want to run a program on a specific set of files, you're better
off wrapping the invocation in a script.

~~~
skohan
The fact that this article is about jupyter notebooks, and apparently doesn't
know that gitignore exists tells me that this is probably written by a data
scientist who is in the process of "discovering" solutions to long-solved
problems in software development.

~~~
wodenokoto
For a notebook I would ask the user to change a filepath variable in one of
the top cells.

But I'm also a stupid data scientist.

What do you propose as the long-solved solution?

~~~
pxndx
I'd propose reading data from stdin. Seeing this is about python, one could
even consider using fileinput [0] to manage multiple files.

[0]:
[https://docs.python.org/3/library/fileinput.html](https://docs.python.org/3/library/fileinput.html)

~~~
wodenokoto
It is a jupyter notebook, not a python script. Your proposal won't work.

You generally don't open a notebook from the commandline, you open it from the
notebook file picker inside your browser.

You generally don't "run" a notebook, like you would a script. It is an
interactive programming environment.

[https://github.com/jupyter/nbconvert/issues/681](https://github.com/jupyter/nbconvert/issues/681)

------
yellowapple
Even better would be to just use standard input/output. Like so (assuming
pandas accepts ordinary file handles for reading and writing CSVs; I have very
limited experience with it, and I tend to avoid Python in general):

    
    
        #!/usr/bin/env python
        # process-csv.py
        import pandas
        
        def expensive_operation(data):
            # TODO
            return data
        
        input_data = pandas.read_csv(sys.stdin)
        results_data = expensive_operation(input_data)
        results_data.to_csv(sys.stdout)
    

Or maybe even shorter (I think this is valid Python?):

    
    
        #!/usr/bin/env python
        import pandas
        
        def expensive_operation(data):
            # TODO
            return data
        
        expensive_operation(pandas.read_csv(sys.stdin)).to_csv(sys.stdout)
    

And then use it like so:

    
    
        $ python process-csv.py <input.csv >results.csv
    

Or like so:

    
    
        $ csv-generating-command | python process-csv.py | csv-consuming-command
    

You could even do a quick

    
    
        $ chmod +x process-csv.py
    

And then call it directly

    
    
        $ ./process-csv.py <in.csv >out.csv
    

Or copy it to somewhere in your $PATH

    
    
        $ cp process-csv.py ~/bin/process-csv
    

At which point you can run from anywhere

    
    
        $ cd /literally/anywhere/else/
        $ sudo pip install csvcat  # or something
        $ csvcat *.csv | process-csv >~/Documents/results-201904021616.csv
    

Of course, we could get even crazier with a custom Pip package or whatever the
actual terminology is (egg?) and do all the setup.py and requirements.txt and
whatnot that Python packaging entails and yadda yadda yadda, but that's
probably overkill for a one-off script. Point is: your fancy Macbook has a
full-blown actual UNIX™ on it, so might as well put it to good use :)

~~~
chthonicdaemon
(author of post)

I am a huge fan of this approach and use it extensively in my own code. This
natural commandline interaction is completely alien to my students. They are
barely holding on with trying to learn Python in the notebook environment,
telling them that the solution to their problem is to learn another language
has not woked well for me.

~~~
yellowapple
Gotcha. Given that Jupyter feels pretty alien to me coming from a command-
line-heavy background, I can understand the stubbornness :)

In that case, I'd probably more readily recommend just adding a *.csv line to
.gitignore and letting students add/remove files accordingly (unless the
lesson is specifically about how configuration files work, of course, in which
case by all means your approach is reasonable).

~~~
chthonicdaemon
Adding the .csv files to .gitignore only avoids the one issue of having
checking in data files. How do collaborators ensure they’re working on the
same data files?

I am aiming to minimise the manual steps involved in running the latest
version of the code on the latest version of the input data. With my system
you just pull the latest code and run it.

~~~
yellowapple
You can have that with gitignored local data files, too, though: since git is
ignoring the data, a simple "git pull" will indeed ignore what's already in
there. Meanwhile, the answer to "How do collaborators ensure they're working
on the same data files?" would be "Drag and drop those files into the source
repo from whichever means they wanna use to share them.".

Yet another way to skin this particular cat might be to write a script as a
drag-and-drop target. I don't think Jupyter supports this, so your students
would have to venture beyond it a bit, but your Windows-using students for
example can drag files onto actual .py files and those dropped files will show
up in sys.argv (I think macOS requires the extra step of using py2app, but
that might be a good opportunity for a py2app/py2exe lesson). I'd post sample
code like in my other comments, but I'm on my phone at the moment.

------
kbirkeland
Isn't this what git lfs[0] sets out to solve?

[0] [https://git-lfs.github.com/](https://git-lfs.github.com/)

------
kevsim
Jupyter notebooks are really abysmal when it comes to best engineering
practices. There's an amazing talk called "I don't like notebooks" by Joel
Grus at last years JupyterCon [0]. I highly recommend it.

[0]
[https://www.youtube.com/watch?v=7jiPeIFXb6U](https://www.youtube.com/watch?v=7jiPeIFXb6U)

------
segmondy
There's a difference between data and code?

~~~
yellowapple
Not everyone uses Lisp ;)

------
vkaku
The other important question is : when to.

------
lincpa
Using the input and output characteristics of pure functions, pure functions
are used as pipelines. Dataflow is formed by a series of pure functions in
series. A dataflow code block as a function, equivalent to an integrated
circuit element (or board)。 A complete integrated system is formed by serial
or parallel dataflow.

Can also be said, Data and logic are strictly separated, Element level
separation of data and logic, data stream processing.

The sea sails by the helmsman and the programming moves toward the data.
Initial state, final state, the shortest linear distance between two points.
Simplicity is the root of fast, stable and reliable.

[https://github.com/linpengcheng/PurefunctionPipelineDataflow](https://github.com/linpengcheng/PurefunctionPipelineDataflow)

~~~
millstone
The article is about avoiding bloat in SCM repos.

~~~
lincpa
I can't connect to this site (blogspot.com), ERR_CONNECTION_TIMED_OUT

So, I can only comment by the title.

~~~
wodenokoto
In that case, please refrain from commenting

