
Data Science at the Command Line - kawera
https://www.datascienceatthecommandline.com/
======
jeroenjanssens
Believe it or not, it's partly thanks to you, HN, that I wrote this book in
2014 in the first place [1]! It's humbling to see it listed here again now
that the text has become available under a CC BY-ND license. Thank you for
your help in spreading the word.

All this attention (read: likes, shares, and page views) is making me wonder
whether it's worthwhile to write an update (or even a second edition). What do
you think? What would you like to see changed or added?

[1]
[https://news.ycombinator.com/item?id=6412190](https://news.ycombinator.com/item?id=6412190)

~~~
utefan001
sort | uniq -c | sort -n example in chapter 5 is something I use a lot. Do you
typically use Mac or Linux? The way Mac does threading made sort 16 times
slower on 1GB+ size files compared to my $300 old Lenovo laptop running
Ubuntu.

~~~
vondur
Have you tried installing it from homebrew? brew install coreutils This
assumes you have brew installed.

~~~
utefan001
Yea, same issue with gnu coreutils. Mac uses a fairness threading algorithm,
causing lots of time to be wasted deciding whose turn it is.

[https://stackoverflow.com/questions/28888719/multi-
threaded-...](https://stackoverflow.com/questions/28888719/multi-threaded-c-
program-much-slower-in-os-x-than-linux)

------
minimaxir
While the efficiency of a command line is always sexy, but for data science
_in particular_ , where reproducibility is important and bugs are subtle and
often don't cause a terminal error, it is worth it to sacrifice a little bit
of code efficiency for code _clarity_ in the long run by using an
IDE/Notebook.

~~~
rabidrat
Is there a tool that would create a notebook from the command-line?

~~~
marmaduke
Jupyter notebook has Bash kernels

~~~
jeroenjanssens
[http://jeroenjanssens.com/2015/02/19/ibash-
notebook.html](http://jeroenjanssens.com/2015/02/19/ibash-notebook.html)

------
fnl
Given the number of tools around, and as this book promotes Drake, has anyone
got "comparative experience" with some of the following tools:

Cookiecutter: [https://drivendata.github.io/cookiecutter-data-
science](https://drivendata.github.io/cookiecutter-data-science)

DataVersionControl:
[https://dataversioncontrol.com/](https://dataversioncontrol.com/)

Drake: [https://github.com/Factual/drake](https://github.com/Factual/drake)

Luigi: [https://github.com/spotify/luigi](https://github.com/spotify/luigi)

Pachyderm: [http://www.pachyderm.io/](http://www.pachyderm.io/)

Sacred: [https://github.com/IDSIA/sacred](https://github.com/IDSIA/sacred)

They all focus in slightly different ways on the issue of managing data
science/machine learning workflows, so I wonder if someone has a clear
preference for one of those over any another.

EDIT: added Luigi

~~~
fnl
To amend, why I'm even bringing this up, what worries me about Drake is this:
[https://github.com/Factual/drake/pulse/monthly](https://github.com/Factual/drake/pulse/monthly)
Its GitHub pulse is - dead; For two years now. Makes me think one of the other
projects listed might be better choices.

~~~
bringtheaction
I was about to say maybe it's finished but it has 70 open issues so maybe it
was just abandoned?

It is at version 1.0.3 though so it could be that it's considered finished.
Seems strange to leave the issues open if it was though.

~~~
fnl
Maybe so. Yet, I can't find any features in Drake that I don't get with Make,
too - in fact, it looks to be rather the opposite.

Indeed, for some of the tools I listed, they barely have any more
functionality than I'd get out of Make & Git alone. And for Make, I'm pretty
sure development & support will stick around for a few more years...

To me, only Luigi (Hadoop integration), Pachyderm (containers, production
deployments in the Enterprise version) and Sacred (Python & TensorFlow
integration) really stick out as differentiating themselves. But maybe I'm
overlooking something?

------
rdudekul
As a developer I have not paid much attention to the power of command line
tools, to get things done. This book is a great resource that uses docker to
create simple command line utilities to implement useful Data Science
functionality.

------
jhoechtl
An indispensable tool for the command line is miller

[https://github.com/johnkerl/miller](https://github.com/johnkerl/miller)

------
bobivl
I also used bash scripts a lot to get quick insights from csv files. Someday I
realized that these are mostly sql queries that I encoded into complex
scripts. For the sake of trying, I implemented a simple sql to bash transpiler
that takes a sql query and returns a bash one-liner that you can execute on
csv file(s).

Give it a try: [http://bigbash.it](http://bigbash.it)

------
jimnotgym
Is there a good way to get this onto an e-reader, am I missing something
obvious?

~~~
lonriesberg
You can buy a Kindle version from Amazon: [https://www.amazon.com/Data-
Science-Command-Line-Time-Tested...](https://www.amazon.com/Data-Science-
Command-Line-Time-Tested/dp/1491947853)

