Data Science at the Command Line

jeroenjanssens · on Jan 27, 2018

Believe it or not, it's partly thanks to you, HN, that I wrote this book in 2014 in the first place [1]! It's humbling to see it listed here again now that the text has become available under a CC BY-ND license. Thank you for your help in spreading the word.

All this attention (read: likes, shares, and page views) is making me wonder whether it's worthwhile to write an update (or even a second edition). What do you think? What would you like to see changed or added?

[1] https://news.ycombinator.com/item?id=6412190

utefan001 · on Jan 27, 2018

sort | uniq -c | sort -n example in chapter 5 is something I use a lot. Do you typically use Mac or Linux? The way Mac does threading made sort 16 times slower on 1GB+ size files compared to my $300 old Lenovo laptop running Ubuntu.

vondur · on Jan 27, 2018

Have you tried installing it from homebrew? brew install coreutils This assumes you have brew installed.

utefan001 · on Jan 27, 2018

Yea, same issue with gnu coreutils. Mac uses a fairness threading algorithm, causing lots of time to be wasted deciding whose turn it is.

https://stackoverflow.com/questions/28888719/multi-threaded-...

minimaxir · on Jan 27, 2018

While the efficiency of a command line is always sexy, but for data science in particular, where reproducibility is important and bugs are subtle and often don't cause a terminal error, it is worth it to sacrifice a little bit of code efficiency for code clarity in the long run by using an IDE/Notebook.

achompas · on Jan 27, 2018

Notebooks are not much better than copy-pasting from a notepad or editor into an interpreter. They’re great for reports, but dangerous for presenting the illusion of reproducibility.

At best you’re constantly restarting your kernel and clearing output. More likely, output from cell #7 has modified output [138] but you haven’t updated the chart produced in cell #17 (or some similar craziness). Not much better than programming with GOTOs.

“But they’re great for reporting and visualization!” you might say. If you’re building any report of value, though, it will influence important decisions. That’s the reason your code shouldn't live in a notebook. It should be in a library, covered by unit tests, so that those decisions aren’t based on faulty logic.

minimaxir · on Jan 27, 2018

Notebooks (and this command-line ebook) assume that the input data is static (i.e. an ad-hoc analysis) which is a more typical use case.

Dynamic data/reporting is a different thing entirely, at which point things like business intelligence software and dashboards come into play, and outside the scope of a command line anyways.

achompas · on Jan 29, 2018

I might have needed to edit down my comment, but you locked on to the least important part of my argument.

If reproducibility is important, like you say, then a notebook is the last thing you want. Your code needs to be tested and designed like the software it is, instead of tossed into some notebook that does not fit into classic software practices.

IanCal · on Jan 28, 2018

Notebooks however let you run code blocks in arbitrary orders, delete ones but keep their results in memory, change them without rerunning and change code with none of the downstream dependencies updating.

It's possible (actually very easy) to have code which works as you're making it but not if you run it from scratch.

minimaxir · on Jan 28, 2018

Which is why notebooks typically a) declare all imports/dependencies in the first block and b) the entire notebook is rerun from a fresh session before publishing (and notebooks have a keyboard command for doing that too).

In all my years of work with Notebooks, I've never had an issue with downstream dependencies.

achompas · on Jan 29, 2018

Important distinction: the notebooks themselves don't do this. The notebook authors, by convention, are supposed to do this.

This is no different than engineers who are, by convention, supposed to write bug-free code. Even with this convention, however, devs still rely on testing to decrease the likelihood of bugs.

No similar tooling exists for notebooks, which is why I recommend moving re-used logic into a tested codebase.

confounded · on Jan 28, 2018

> ... using an IDE/Notebook.

I was with you until the end. The important part is that a script is clear and commented, and can be written to fail informatively. These are benefits of using a scripting language.

Who cares if an IDE was used rather than a traditional editor?

Dynamic notebooks and the JSON mess they generate are a personal peeve of mine, and IME, the enemy of reproducibility.

If it’s for a one-off analysis, but processing is complicated enough that scripts don’t make dependencies between processing steps clear, use GNU Make.

If it’s a data product that’s running in the background, consider something like Airflow.

Still, I find coreutils and friends incredibly useful for interactively sizing up text data.

peatmoss · on Jan 27, 2018

I very frequently do data tasks as bash command lines and do so within an org-mode code block. So, at least with org-mode, notebook computing and data processing in the shell are not mutually exclusive.

EDIT: Also I should note that notebooks are not the only (or in my opinion best) way to present a reproducible analysis.

arca_vorago · on Jan 28, 2018

I'm working on my data science degree (previously a senior sysadmin), and this is also the workflow I have settled on. Along with versioning/diffs so you can walk back in time to see your changes in a script, I think it's one of the more robust and reproduceable systems around.

So my scripts focus on the data, but around the code block are comments and notes about what is being done, etc.

I've recently been looking at how to integrate makefile into this since watching the following video: https://www.youtube.com/watch?v=sd0HhW8vkSQ

All that said, kudos to the author for this, reading through it I see a few things I hadn't even though of yet, and I live on the command line in general. If it werent for browsing the internet (eww in emacs is nice though) and gaming, I don't think I'd even need a desktop environment.

sedachv · on Jan 29, 2018

> If it werent for browsing the internet (eww in emacs is nice though) and gaming, I don't think I'd even need a desktop environment.

I switched to EXWM[1]; Firefox looks and acts like an Emacs buffer.

[1] https://github.com/ch11ng/exwm

pjmorris · on Jan 28, 2018

Code clarity includes the interfaces of the tool of choice. I suspect that shell files checked in to a version control system will have a longer working life because IDE/Notebook interfaces are more likely to change and become incompatible over time driven by the commercial needs of their vendors.

swhalen · on Jan 27, 2018

It seems to me that encapsulating analyses in bash scripts would help with reproducibility.

tejtm · on Jan 27, 2018

scripts marshalled with makefiles have helped me with long period recurrent tasks that also slowly evolve

rabidrat · on Jan 27, 2018

Is there a tool that would create a notebook from the command-line?

eivarv · on Jan 27, 2018

I think that depends on what you mean; Jupyter lets you run magic[0] commands (run bash commands in subprocesses), and there's also a bash kernel[1].

[0]: https://blog.dominodatalab.com/lesser-known-ways-of-using-no...

[1]: https://github.com/takluyver/bash_kernel

marmaduke · on Jan 27, 2018

Jupyter notebook has Bash kernels

jeroenjanssens · on Jan 27, 2018

http://jeroenjanssens.com/2015/02/19/ibash-notebook.html

adeelk93 · on Jan 27, 2018

You can also do the same in RMarkdown, if you're more of an R user

make3 · on Jan 27, 2018

notebooks are garbage because of the arbitrary order of execution

goatlover · on Jan 28, 2018

The cells have numbers corresponding to execution, and you can restart and rerun the entire notebook, or down to a certain cell very easily.

make3 · on Jan 28, 2018

jupyter specifically discourage this by hiding the button to run the whole notebook in a menu and by not assigning a shortcut to the functionality. It's better in colab though, where at least there is a button.

The thing is that people who are not professional programmers often don't realize that this is a danger, and their thing doesn't work and they don't know that it's because they're in a weird state emanating from the random execution of code blocks. So, I mean, notebooks are useful and cool, but they're definitely dangerous, especially for people who aren't software engineers, which is likely a huge fraction of their users

fnl · on Jan 27, 2018

Given the number of tools around, and as this book promotes Drake, has anyone got "comparative experience" with some of the following tools:

Cookiecutter: https://drivendata.github.io/cookiecutter-data-science

DataVersionControl: https://dataversioncontrol.com/

Drake: https://github.com/Factual/drake

Luigi: https://github.com/spotify/luigi

Pachyderm: http://www.pachyderm.io/

Sacred: https://github.com/IDSIA/sacred

They all focus in slightly different ways on the issue of managing data science/machine learning workflows, so I wonder if someone has a clear preference for one of those over any another.

EDIT: added Luigi

fnl · on Jan 27, 2018

To amend, why I'm even bringing this up, what worries me about Drake is this: https://github.com/Factual/drake/pulse/monthly Its GitHub pulse is - dead; For two years now. Makes me think one of the other projects listed might be better choices.

bringtheaction · on Jan 27, 2018

I was about to say maybe it's finished but it has 70 open issues so maybe it was just abandoned?

It is at version 1.0.3 though so it could be that it's considered finished. Seems strange to leave the issues open if it was though.

fnl · on Jan 28, 2018

Maybe so. Yet, I can't find any features in Drake that I don't get with Make, too - in fact, it looks to be rather the opposite.

Indeed, for some of the tools I listed, they barely have any more functionality than I'd get out of Make & Git alone. And for Make, I'm pretty sure development & support will stick around for a few more years...

To me, only Luigi (Hadoop integration), Pachyderm (containers, production deployments in the Enterprise version) and Sacred (Python & TensorFlow integration) really stick out as differentiating themselves. But maybe I'm overlooking something?

rdudekul · on Jan 27, 2018

As a developer I have not paid much attention to the power of command line tools, to get things done. This book is a great resource that uses docker to create simple command line utilities to implement useful Data Science functionality.

jhoechtl · on Jan 27, 2018

An indispensable tool for the command line is miller

https://github.com/johnkerl/miller

bobivl · on Jan 27, 2018

I also used bash scripts a lot to get quick insights from csv files. Someday I realized that these are mostly sql queries that I encoded into complex scripts. For the sake of trying, I implemented a simple sql to bash transpiler that takes a sql query and returns a bash one-liner that you can execute on csv file(s).

Give it a try: http://bigbash.it

jimnotgym · on Jan 27, 2018

Is there a good way to get this onto an e-reader, am I missing something obvious?

lonriesberg · on Jan 30, 2018

You can buy a Kindle version from Amazon: https://www.amazon.com/Data-Science-Command-Line-Time-Tested...