Hacker News new | past | comments | ask | show | jobs | submit login
Exploratory data analysis for humanities data (awk.dev)
177 points by yarapavan on Oct 6, 2023 | hide | past | favorite | 40 comments



Awk is awesome and Dr. Kernighan has taught me so much.

If you like exploratory data analysis using awk, you may like the "num" command:

https://github.com/numcommand/num

Num uses awk for command line statistics, such as standard deviation, kurtosis, quartiles, uniqueness, ordering, and more. Num runs on a very wide range of Unix systems, such as systems without package managers.

Feature requests and PRs are welcome.


The article nails down a very real pain point with libraries like Pandas:

> looping over a set of input lines seems more natural than the dataframe selectors that Pandas favors

Row-oriented operations, as opposed to aggregations and other OLAP-style queries are kind of painful. The generator machinery (yield from) is a partial fix to this, but Pandas itself offers little relief.


pandas has a poor API. I'd rather use SQL with DuckDB.


pandas is way more powerful that most people use it.

when you have to deal with thousands of text files, mish mash of csv, tsv, some rows overlap between the files, some files spread across multiple different locations (shared drive, s3 bucket, URL, SQL db, etc), with column names that look similar but not exactly similar - this is perfect use case for pandas.

read csv file? just pd.read_csv()

read and concat N csv files? just pd.concat([pd.read_csv(f) for f in glob("*.csv")])

read parquet or read_sql()? not a problem at all.

need to do some custom rules for data cleansing, or regex matching or fuzzy matching on column names, converting data from/to csv/parquet/sql - it will be pandas 1 liner

a lot of painful data processing/cleaning, correcting data is just 1-liner in pandas, and I dont know of better tool that can beat pandas - probably tidyR but it is essentially same pandas just for R


> essentially same pandas just for R

You are aware the pandas was designed to replicate the behavior of base R's dataframes?

I've been a heavy user of both and R's data frames are still superior to pandas even without the tidyverse.

Pandas is really nice for the use case it was designed for: working with financial data. This is a big part of why Pandas's indices feel so weird for everything else, but if your index is a time in a financial time series then all of a sudden Pandas makes sense and works great

When not working with financial data I try to limit the amount of time my code touches pandas, and increasingly find numpy + regular python works better and is easier to build out larger software with. It also makes it much easier to port your code into another language for use in production (i.e. it's quick and easy to map standard python to language X, but not so much a large amount of non-trivial pandas).


with pandas2.0 and using arrow backend instead of numpy - pandas became "cloud datalake native" - you can essentially read from arrow files in S3 very efficiently and at any large scale - and store/process arbitrarily large amounts of files in a cheap serverless infra. Arrow format is also supported by other languages.

with s3+sqs+lambda+pandas - and you can build cheap serverless data processing pipelines and iterate extremely quickly


Do you have any benchmarks about how much data a given lambda can search/process after loading Arrow data? Not trying to argue, I'm curious because I never thought of this architecture myself, because I would think that the time it takes to ingest the Arrow data and then search through it would be too long for a lambda but I may be totally off base here. I've not played around in detail with lambdas so I don't have particularly robust mental model on their limitations.


reading/writing Arrow is zero serde overhead operation to/from memory to disk.

I think of lambda as a thread, and you can put a trigger on S3 bucket on each incoming file - to get processed. This allows you to get around GIL, and lets you invoke your lambda for each mini-batch.

assuming you have high volume and frequency of data - you will need to "cool down" your high frequency data, and switch from row-basis (like millions of rows per second) to mini-batch basis (like one batch file per 100Mb).

This can be achieved by having kafka with high partition number on the ingestion side, and sink to s3.

from S3 for each new file your lambda will be invoked and minibatch will be processed by your python code, and you can right size your lambda's RAM, but usually I reserve 2-3x size of a batch file for lambda.

the killer feature is zero ops. Just by tuning your minibatch size you can regulate how many times your lambda will be invoked


Very cool. Do you then further aggregate and load into a DB or vector store or something?


R also has data.table, which extends data.frame and is pretty powerful and very fast


R + data.table is a lot faster than Base R.

See a benchmark of Base R vs R + data.table (plus various other data wrangling solutions, including our own Easy Data Transform) at:

https://www.easydatatransform.com/data_wrangling_etl_tools.h...


> Pandas is way more powerful

Only if you 1) don't know SQL and 2) working with tiny datasets that are around 5% of your total RAM.


I guess it depends on who you ask but personally I am able to write pandas much faster than loading data into a DB and then processing it. The reason is pandas defaults on from_ and to_ are very sane and you don’t need to think about things like escaping strings and stuff. It’s also easy to deal with nulls quickly in pandas and rapidly get some EDA graphs like in R.

The other benefit of pandas is it’s in python so you can use your other data analysis libraries whereas with SQL you need to marshal back and forth between python and SQL.

My usual workflow is: Explore data in pandas/datasette, if it’s big data I explore just a sample and use bash tools to pull out the sample -> write my notebook in pandas -> scale it up in spark/dask/polars depending on use case.

This is pretty good cause ChatGPT understands pandas, pyspark, and SQL really well so you can easily ask it to translate scripts or give you code for different things.

On scalability if you need scale there’s many options today to process large datasets with a dataframe api e.g koalas, polars, dask, modin etc.


>>Only if you 1) don't know SQL and 2) working with tiny datasets that are around 5% of your total RAM.

this is only true only for newbie python devs that learned about pandas from blogs on medium.com. I have pipelines that process terabytes per day in a serverless datalake, and it requires zero DBA work that usually comes if you use anything *Sql


I've processed TBs of CSV files with pandas. You can always read files in chunks and in the end, SQL will also need to read data somewhere from a disk.



interesting, but I would still prefer pandas for data cleansing/manipulation, just because I won't be limited by SQL syntax - and can always use df.apply() and/or any python package for custom processing.

pandas using apache arrow backend also makes it high performance and compatible with cloud native data lakes

plus compatibility with sklearn package makes it a killer feature, with just few lines you can bolt on ML model on top of your data


It definitely has its place. I like to get it to grab the data, clean it up and get out into python / Postgres. I don’t like to have spreading through the codebase.


nobody is denying that pandas is powerful. But their syntax and API uses very inconsistent and hard to reconcile patterns. It's painful because it's hard to memorize and most everything has to be looked up


This has become my workflow too. Admittedly though I've spent most of my career writing large amounts of SQL, and was a pretty heavy Tidyverse user for a while, so that all makes a lot more sense to me than Pandas. I generally get my data into whatever shape I need it in and then load it into pandas.


I'm all for this kind of exploratory hacking around before booting up python/R/Excel/duckdb, especially in constrained environments. A classic pain point is having to deal with column numbers, so I'll share my favorite trick:

`head -n1 /path/to/file.csv | tr ',' '\n' | nl | grep desired_column`

gives you the column number of desired_column


Something to watch out for with nl is that by default it doesn't number empty lines. e.g.:

  $ printf 'one\n\nthree\n' | nl
     1  one

     2  three
Set -ba to enable numbering all lines.

For this use case I usually end up running cat -n instead since I find it easier to remember.


yep, without knowing about `nl` I used `...| grep -n column_header` or `...|grep -n .` to replicate the 'nl' behavior.

edit: I like your 'nl' better as it is using white space instead of colon as a separator.


oh, didn't realise you could do that - I've tended to use `... |awk '{print $NF, $0}'` to add line numbers to something


Unless there is a quoted comma or an empty column beforehand (nl “helpfully” skips empty lines for numbering purposes).


grep -n also works in place of `nl`!


I teach text processing to linguistics students who usually have had zero programming experience. I start by introducing regular expressions in the environment of a text editor. The students take quickly to the declarative nature of regexes and become excited by the power of automation.

Moving on to scripting in a language is hard. I would prefer awk, but I usually have to introduce R, because that is what "everyone" is using these days. When faced with the complexity of a full language, most students lose their enthusiasm and never touch programming again after their thesis.

You can accomplish quite a lot with a powerful text editor and just regexes. I think it would be better for most students to just stick with that.


Maybe ETL tools could be a good next step (like FME). You get really high quality input and output drivers. Functional tools for basic data tasks. The data can be easily explored at any stage. And you can drop in regexp or python where it is needed.


Nowadays I would introduce chatgpt as an intermediate layer to use the language. I did some experiments recently with chatgpt4.0 and R and, while not perfect, it did great being able to go from specification to code to interpretation of the results.


Are others noticing the author. This is the co-creator of UNIX, co-author of The C Programming Language and the "K" in AWK, and more, teaching UNIX to non-CS majors. That's pretty cool.


Sorry but presenting awk as a serious alternative to pandas in 2023 to people who aren’t very computing savvy is just mischievous.


Yeah it’s just terrible advice considering what tooling peers will be using in the field.

The author mentions being biased a lot but like… that’s a LOT of bias.


Recent Awk convert (after, like most people, just using it for one-liners for years); it's aged remarkably well (although I wish it used more functional constructs, permitted proper variable initialization, and had interrupt handling... but at that point, it's probably best to switch to a "full" language...)


Wow, imagine being a humanities major and having Brian Kernighan teach you Awk!


No mention of visidata yet?

If you like vi style interfaces, or TUIs and have data to explore then check it out. (Native language for manipulating data is python)

https://www.visidata.org/


Course website (linked from the article): https://www.hum307.com/


The mighty awk: a great tool, but not the one I start with teaching people interested in applying computational methods. What I notice is that it moves 'too quick' on prompt execution--great for sysadmins, developers and on the job data analysts, but a 'slower' tool is often easier for newcomers to get their heads around, who'd often like to see what happens between steps as well as natural language error messages when something goes wrong.

KNIME, Orange and GSheets + Apps Script fill this niche to some extent (and I've been wanting to give ENSO Lang a try), enabling rapid prototyping, iteration and visual clues when steps fail.

Then, through repeatedly running the same steps in these environments, it suddenly 'clicks' for some that many of the data preprocessing steps follow a similar pattern, and that much time can be saved using scripting. You have to tease out this eagerness somewhat, by having learners first run through the more arduous process of clicking, dragging, navigating and debugging through a 'slow' interface.

In this regard, some have started to create workflows and test suites around software onboarding, which will yield some valuable insights into what pedagogical strategies work best for different computational skill levels [0][1][2].

[0]: https://jku-vds-lab.at/publications/2022_eurovis_dashboard_o...

[1]: https://jku-vds-lab.at/publications/2022_visinf_compare_eval...

[2]: https://ideah.pubpub.org/pub/yia9z29r/release/1


If I were at Princeton, I would take every one of Kernighan's classes that I could! I wonder if that's a problem there.


I'm at this point 15 years removed, but Prof Kernighan was one of the most accessible professors and taught the most popular CS survey course (333).

I have at least a half a dozen times where I was pointed his direction from another professor and Kernighan spent an hour with me looking into how to scrape a dynamic website for my auction theory project. When he was stumped he introduced me to a professor at another school who he knew had looked into the topic.


Ah! That is awkard! Sorry, I couldn't resist, I have all respect for Awk.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: