Num uses awk for command line statistics, such as standard deviation, kurtosis, quartiles, uniqueness, ordering, and more. Num runs on a very wide range of Unix systems, such as systems without package managers.
The article nails down a very real pain point with libraries like Pandas:
> looping over a set of input lines seems more natural than the dataframe selectors that Pandas favors
Row-oriented operations, as opposed to aggregations and other OLAP-style queries are kind of painful. The generator machinery (yield from) is a partial fix to this, but Pandas itself offers little relief.
pandas is way more powerful that most people use it.
when you have to deal with thousands of text files, mish mash of csv, tsv, some rows overlap between the files, some files spread across multiple different locations (shared drive, s3 bucket, URL, SQL db, etc), with column names that look similar but not exactly similar - this is perfect use case for pandas.
read csv file? just pd.read_csv()
read and concat N csv files? just pd.concat([pd.read_csv(f) for f in glob("*.csv")])
read parquet or read_sql()? not a problem at all.
need to do some custom rules for data cleansing, or regex matching or fuzzy matching on column names, converting data from/to csv/parquet/sql - it will be pandas 1 liner
a lot of painful data processing/cleaning, correcting data is just 1-liner in pandas, and I dont know of better tool that can beat pandas - probably tidyR but it is essentially same pandas just for R
You are aware the pandas was designed to replicate the behavior of base R's dataframes?
I've been a heavy user of both and R's data frames are still superior to pandas even without the tidyverse.
Pandas is really nice for the use case it was designed for: working with financial data. This is a big part of why Pandas's indices feel so weird for everything else, but if your index is a time in a financial time series then all of a sudden Pandas makes sense and works great
When not working with financial data I try to limit the amount of time my code touches pandas, and increasingly find numpy + regular python works better and is easier to build out larger software with. It also makes it much easier to port your code into another language for use in production (i.e. it's quick and easy to map standard python to language X, but not so much a large amount of non-trivial pandas).
with pandas2.0 and using arrow backend instead of numpy - pandas became "cloud datalake native" - you can essentially read from arrow files in S3 very efficiently and at any large scale - and store/process arbitrarily large amounts of files in a cheap serverless infra. Arrow format is also supported by other languages.
with s3+sqs+lambda+pandas - and you can build cheap serverless data processing pipelines and iterate extremely quickly
Do you have any benchmarks about how much data a given lambda can search/process after loading Arrow data? Not trying to argue, I'm curious because I never thought of this architecture myself, because I would think that the time it takes to ingest the Arrow data and then search through it would be too long for a lambda but I may be totally off base here. I've not played around in detail with lambdas so I don't have particularly robust mental model on their limitations.
reading/writing Arrow is zero serde overhead operation to/from memory to disk.
I think of lambda as a thread, and you can put a trigger on S3 bucket on each incoming file - to get processed. This allows you to get around GIL, and lets you invoke your lambda for each mini-batch.
assuming you have high volume and frequency of data - you will need to "cool down" your high frequency data, and switch from row-basis (like millions of rows per second) to mini-batch basis (like one batch file per 100Mb).
This can be achieved by having kafka with high partition number on the ingestion side, and sink to s3.
from S3 for each new file your lambda will be invoked and minibatch will be processed by your python code, and you can right size your lambda's RAM, but usually I reserve 2-3x size of a batch file for lambda.
the killer feature is zero ops. Just by tuning your minibatch size you can regulate how many times your lambda will be invoked
I guess it depends on who you ask but personally I am able to write pandas much faster than loading data into a DB and then processing it. The reason is pandas defaults on from_ and to_ are very sane and you don’t need to think about things like escaping strings and stuff. It’s also easy to deal with nulls quickly in pandas and rapidly get some EDA graphs like in R.
The other benefit of pandas is it’s in python so you can use your other data analysis libraries whereas with SQL you need to marshal back and forth between python and SQL.
My usual workflow is:
Explore data in pandas/datasette, if it’s big data I explore just a sample and use bash tools to pull out the sample -> write my notebook in pandas -> scale it up in spark/dask/polars depending on use case.
This is pretty good cause ChatGPT understands pandas, pyspark, and SQL really well so you can easily ask it to translate scripts or give you code for different things.
On scalability if you need scale there’s many options today to process large datasets with a dataframe api e.g koalas, polars, dask, modin etc.
>>Only if you 1) don't know SQL and 2) working with tiny datasets that are around 5% of your total RAM.
this is only true only for newbie python devs that learned about pandas from blogs on medium.com. I have pipelines that process terabytes per day in a serverless datalake, and it requires zero DBA work that usually comes if you use anything *Sql
I've processed TBs of CSV files with pandas. You can always read files in chunks and in the end, SQL will also need to read data somewhere from a disk.
interesting, but I would still prefer pandas for data cleansing/manipulation, just because I won't be limited by SQL syntax - and can always use df.apply() and/or any python package for custom processing.
pandas using apache arrow backend also makes it high performance and compatible with cloud native data lakes
plus compatibility with sklearn package makes it a killer feature, with just few lines you can bolt on ML model on top of your data
It definitely has its place. I like to get it to grab the data, clean it up and get out into python / Postgres. I don’t like to have spreading through the codebase.
nobody is denying that pandas is powerful. But their syntax and API uses very inconsistent and hard to reconcile patterns. It's painful because it's hard to memorize and most everything has to be looked up
This has become my workflow too. Admittedly though I've spent most of my career writing large amounts of SQL, and was a pretty heavy Tidyverse user for a while, so that all makes a lot more sense to me than Pandas. I generally get my data into whatever shape I need it in and then load it into pandas.
I'm all for this kind of exploratory hacking around before booting up python/R/Excel/duckdb, especially in constrained environments. A classic pain point is having to deal with column numbers, so I'll share my favorite trick:
I teach text processing to linguistics students who usually have had zero programming experience. I start by introducing regular expressions in the environment of a text editor. The students take quickly to the declarative nature of regexes and become excited by the power of automation.
Moving on to scripting in a language is hard. I would prefer awk, but I usually have to introduce R, because that is what "everyone" is using these days. When faced with the complexity of a full language, most students lose their enthusiasm and never touch programming again after their thesis.
You can accomplish quite a lot with a powerful text editor and just regexes. I think it would be better for most students to just stick with that.
Maybe ETL tools could be a good next step (like FME). You get really high quality input and output drivers. Functional tools for basic data tasks. The data can be easily explored at any stage. And you can drop in regexp or python where it is needed.
Nowadays I would introduce chatgpt as an intermediate layer to use the language. I did some experiments recently with chatgpt4.0 and R and, while not perfect, it did great being able to go from specification to code to interpretation of the results.
Are others noticing the author. This is the co-creator of UNIX, co-author of The C Programming Language and the "K" in AWK, and more, teaching UNIX to non-CS majors. That's pretty cool.
Recent Awk convert (after, like most people, just using it for one-liners for years); it's aged remarkably well (although I wish it used more functional constructs, permitted proper variable initialization, and had interrupt handling... but at that point, it's probably best to switch to a "full" language...)
The mighty awk: a great tool, but not the one I start with teaching people interested in applying computational methods. What I notice is that it moves 'too quick' on prompt execution--great for sysadmins, developers and on the job data analysts, but a 'slower' tool is often easier for newcomers to get their heads around, who'd often like to see what happens between steps as well as natural language error messages when something goes wrong.
KNIME, Orange and GSheets + Apps Script fill this niche to some extent (and I've been wanting to give ENSO Lang a try), enabling rapid prototyping, iteration and visual clues when steps fail.
Then, through repeatedly running the same steps in these environments, it suddenly 'clicks' for some that many of the data preprocessing steps follow a similar pattern, and that much time can be saved using scripting. You have to tease out this eagerness somewhat, by having learners first run through the more arduous process of clicking, dragging, navigating and debugging through a 'slow' interface.
In this regard, some have started to create workflows and test suites around software onboarding, which will yield some valuable insights into what pedagogical strategies work best for different computational skill levels [0][1][2].
I'm at this point 15 years removed, but Prof Kernighan was one of the most accessible professors and taught the most popular CS survey course (333).
I have at least a half a dozen times where I was pointed his direction from another professor and Kernighan spent an hour with me looking into how to scrape a dynamic website for my auction theory project. When he was stumped he introduced me to a professor at another school who he knew had looked into the topic.
If you like exploratory data analysis using awk, you may like the "num" command:
https://github.com/numcommand/num
Num uses awk for command line statistics, such as standard deviation, kurtosis, quartiles, uniqueness, ordering, and more. Num runs on a very wide range of Unix systems, such as systems without package managers.
Feature requests and PRs are welcome.