Hacker News new | past | comments | ask | show | jobs | submit login
Data Science at the Command Line, 2nd Edition (2021) (jeroenjanssens.com)
173 points by aragonite 23 days ago | hide | past | favorite | 30 comments



Having not read the book I don't know if it delves into the speedups that can come from the fact that pipelines have processes running in parallel. A nice article about how much faster it can be to process data on the command line vs something like Hadoop: https://adamdrake.com/command-line-tools-can-be-235x-faster-...

Clearly, doesn't work in all cases but often enough, it's not only simpler to set up but runs much faster besides.


Always worth giving a shoutout to the "Scalability! But at what COST?" paper (pdf https://www.frankmcsherry.org/assets/COST.pdf)

  We offer a new metric for big data platforms, COST, or the Configuration that Outperforms a Single Thread. The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation. COST weighs a system’s scalability against the overheads introduced by the system, and indicates the actual performance gains of the system, without rewarding systems that bring substantial but parallelizable overheads.


After numerous trials with CUDA enabled machines, I found the "cost function" models are better at determining efficiency.

For example:

* A single GPU machine is often more efficient than multiple cards due to bus i/o bandwidth constraints, lack of software support, and obscure driver failure modes

* A single CPU machine is often more efficient than multi-chip solutions due to memory access latency and caching issues

* A sequential drive data access machine is often more efficient than arrays due to pipelined memory cache layout and baked access latency

* A single CPU core bound process is often more efficient due to avoiding threading and or mailbox overheads

Thus, if a problem is truly separable, than it is sometimes wiser to bind n-many slow jobs to n-many cores rather than parallelize 1 job to try to get it done more quickly.

I found this rather surprising in processing large video media data sets.

Some may disagree, but they are mostly 15% to 30% more wrong for truly separable tasks. YMMV =3


What immediately makes it worth it is when the cloud provider you are (supposedly) already invested in provides the CPU cluster without extra set up cost.

Examples Google BigQuery, AWS Athena or AWS serverless EMR.

These cloud providers always have a bunch of CPUs idle, so they even provide the CPU cost for free, it is just the loading of data that costs.


I love BigQuery, but they do charge you for CPU slots/processing. It's a great deal to not have to manage all that stuff though.


Oh, that's new then? Or maybe it was a part of the monthly bill I never saw.

AWS (still) costs per data loaded though.


Gnu parallel ftw, it’s the single easiest way to speed up command line batches.


Worth highlighting that make defines a DAG, and that you can run make tasks in parallel which will automatically bottleneck (as desired) on common dependencies, and fan out otherwise.

The sort of massive C++ build that make can handle are typically much more complicated than your average ETL pipeline. So, there is plenty of room to grow into make.


Yes, that's extremely important. I've had great success, but replacing Airflow, Luigi, and friends with a cron-ed Makefile target refreshing some database tables (usually Materialized views).

I've then used this tool to visualize the execution graph. https://github.com/lindenb/makefile2graph

The result looks like this: https://tselai.com/data/graph.png

The convenient thing is that each node in the execution graph is in a different environment. Some are shell scripts, a some are Python scripts while others are SQL queries.


The classics never go out of style. Mike Bostock has a nice article about the use of Make for data workflows: https://bost.ocks.org/mike/make/.


There are also a few ways to spread make managed processes over multiple nodes, assuming they share storage, which could be useful if your transforms bottleneck at the CPU rather than on storage or network IO.


This is a great book, there’s a few tools I would add.

Datasette, clickhouse local (cli) and duckdb.

I think ripgrep is a big omission, ripgrep | xargs jq and find -exec jq is one of my most common data science workflows cause you can get stuff done in a few minutes. An example of where I use this is to debug Infrastructure as Code that is generated for many regions and AZs quickly.

Another one I like in this space is Bioninformatics Data Skills, for some reason Bioinformaticians use CLI workflows a lot and this book covers a lot of good info for those who are just starting out like tmux, make, git, ssh, background processes.

Two other techniques I like are git-scraping (tracking the changes of data over time or just saving snapshots of your data to git so you can diff it): https://simonwillison.net/2020/Oct/9/git-scraping/ I most recently have used this technique to diff changes to build artifacts over time.

This technique is not really CLI related per se, but I really like the http range query technique of hosting data: https://news.ycombinator.com/item?id=27016630 There’s simple ways to use this idea (like cooking up a quick h2o web server conf) to host data quickly.

I also like the makefile data pipeline idea, I believe the technique is described in the book but I first heard of it from this HN comment: https://news.ycombinator.com/item?id=18896204 The basic idea is that you can use make to orchestrate the steps of your command line data science workflow and let make figure out when your intermediate data needs to be generated again. A good example is this map reduce with make here: https://www.benevolent.com/news-and-media/blog-and-videos/ho...


"ripgrep + find -exec jq" what does this imply?


I guess I should have written it more explicitly but my two most common techniques are:

- ripgrep PATTERN | xargs jq

- find PATTERN -exec jq

In both cases you have a large amount of data in your file system that is in JSON and you want to extract a subset of it for further processing. Ripgrep is an extremely fast way to do a content based search for a pattern and find is a fast way to do a file path based search. Then jq lets you extract the data.

For other types of data I use the same technique of first using ripgrep to find the candidate files and then piping to a text processor like awk/perl/ruby to actually process the data.

If you need FTS rather than regex search then SQLite FTS5 is my go to.

So a full example is something like:

`find . -name “*.pod.json” -print0 | xargs -0 -P 12 -I {} sh -c ‘jq -r “select(.spec.containers != null) | .spec.containers | to_entries[]” sh {} \; | jq -s ‘sort_by(.image)’`

Something like that so it’s sort of like a map-reduce you first narrow the subset of inputs by first finding by file pattern, then you pull out the relevant data from each in a parallel xargs, then you reduce it with a jq -s. This technique is used because jq is very slow on large files and your later processing scripts might be slow so the first step of a good pipeline is to first throw away all the data you don’t need first.


If you write a book I will buy it.


You might want to look into using jq --stream together with inputs/truncate_stream/fromstream and friends if you want to use jq with large inputs. Not a speed daemon but will probably use a lot less memory.


Thanks, if anyone else is interested there is an explanation of this feature here: https://subtxt.in/library-data/2016/03/28/json_stream_jq And: https://github.com/jqlang/jq/wiki/FAQ#streaming-json-parser

The last time I tried, I think the reason I gave up on JQ for large inputs was that the throughput would max out at 7mb/s whereas the same thing with spark SQL on the same hardware (MacBook) would max out at 250mb/s. So I started looking into using other solutions for big data while I use jq in parallel for small data in multiple files.

I will test it out again cause this was 4-5 years ago when I last tested it, but I believe jaq is still preferred for large inputs. Still I prefer for big data to use Spark/Polars/clickhouse etc.


Thanks. I use 'rg' a lot but it finds part of the json text, not sure how 'jq' can be used with that since you're not finding a full json object in general


Oh I see, so what I do is I use rg -l aka —-files-with-matches. To produce the relevant files and I use the filepath as the key for the next part of the processing pipeline.

So the pattern is to reduce the candidate set of files with rg or find then extract the relevant parts using xargs jq then pipe to jq -s to produce the dataset.


That makes sense, thanks!


What’s the difference between using duckdb and clickhouse? Do you use them together or are they very similar?


I'm relatively new to both and love them both for what they do well. Duck is "in process" like SQLite. Clickhouse is a client/server model like most other databases.

Duck is currently more optimized for the analyst or whomever working locally, importing and cleaning data - it's incredibly generous and well optimized for importing many different formats and handling all manner of CSV varieties etc. It's fast and fun to work with but being in process it's not optimized for, say, setting up remotely as the backend for a BI thing like Metabase and then running typical ETL processes to bring in new data. Only 1 process is allowed to connect at a time, so Metabase would sit on that and then you can't import new data in another process. (There is a RO connect mode where multiple processes can connect to read, but nothing is allowed to write if something else is connected.)

Clickhouse feels more like a traditional data warehouse solution, but wicked fast. It's a lot more fussy with getting data in - it's not nearly as generous with importing CSV for example - but once it's in and in the CH native format it is jaw-droppingly fast. Its compression scheme(s) is super efficient and it's also quite fun to play with.

I'm on a BigQuery backed data stack at my job but dream about usecases where we might employ Clickhouse for something more specialized. I use Duck when I'm building an analysis - rather than constantly connecting to BQ I'll just bring the base data into DuckDB locally and then work from there.

I also use them together on a side project of mine - I eat questionable CSV with Duck, then save it to Parquet which Clickhouse will easily handle.


I'm really excited for MotherDuck, which is basically going to be a BigQuery style serverless data warehouse, but DuckDB flavored, so you can do the local development and then just move it to the cloud without having to translate.


Interesting! I've been thinking about what you are doing but instead using DuckDB and Postgres but maybe I'll give Clickhouse a look instead of Postgres.


Remembering all those unix tools and their uses can be tricky. I wrote a couple of shell scripts that allow you to build command pipelines one step at a time, choosing a tool from a menu at each step, and with the ability to preview the results while tweaking the command line flags at each step. At any point you can go back to the previous step and continue: https://github.com/vapniks/fzf-tool-launcher https://github.com/vapniks/fzfrepl



Thanks! Macroexpanded:

Data Science at the Command Line, 2nd edition (free; 2021) - https://news.ycombinator.com/item?id=30115066 - Jan 2022 (1 comment)

2nd Edition of Data Science at the Command Line Released (Free) - https://news.ycombinator.com/item?id=29589381 - Dec 2021 (1 comment)

Data Science at the Command Line - https://news.ycombinator.com/item?id=16245873 - Jan 2018 (35 comments)

Command-line tools for data science - https://news.ycombinator.com/item?id=6412190 - Sept 2013 (78 comments)


Author here.

Worth pointing out that the discussion from Sept 19 2013 is actually about a blog post, which predates the book. It's because of the amount of feedback it got on HN that I wondered whether this topic could be turned into a book.

Coincidentally, tomorrow I'll be speaking at the New York Open Statistical Programming Meetup [0]. About 11 years ago, I spoke there for the first time. That talk is the reason the aforementioned blog post (and by extension the book) exists.

In short, I owe a lot to that meetup and the HN community. Thanks!

[0] https://nyhackr.org/


I'd like to call out one of my favorite pieces of software from the past 10 years: VisiData [1] has completely changed the way I do ad-hoc data processing, and is now my go-to for pretty much all use cases that I previously used spreadsheets for, and about half of those I previously used local DB instances for.

It's a TUI application, not strictly CLI, but scriptable, and I figure anyone building pipelines using tools like jq, q, awk, grep, etc. to process tabular data will find it extremely useful.

----

[1]: https://visidata.org


Looks like a great resource, bookmarked to read. This reminds me (and perhaps looks like a bigger better more modern version) of another online book I’ve seen that maybe first introduced me to the idea that serious data analysis can be done using the command line: “Ad Hoc Data Analysis From The Unix Command Line”, mostly just covering cat, find, grep, wc, cut, sort, and uniq.

https://en.wikibooks.org/wiki/Ad_Hoc_Data_Analysis_From_The_...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: