
ASK HN: Are there code standards for using Pandas DataFrames in production? - bradmerlin
I&#x27;m working on a Python project where some modelling has been implemented using Pandas. I&#x27;m helping to add an API over the modelling logic, and when I see a function that accepts a dataframe (sometimes many dataframes), it feels like it&#x27;s not obvious what that function requires without reading through all of the function&#x27;s code (e.g. which dataframe columns it requires, maybe even their types, etc.).<p>Requiring series doesn&#x27;t seem like the right thing either because sometimes a function might require a few columns whose rows are related.<p>Is there an accepted way to define these sort of functions that lets the caller to easily understand what columns (or even types) are required? Or am I missing something obvious and this isn&#x27;t a real problem?<p>I can think of a few ways to do it (mostly thinking decorators) but it&#x27;d be awesome to hear what people are doing in the real world.
======
redff0000
Sounds like you're mostly interested in provenance and lineage. Pandas doesn't
really help with that and I haven't seen any serious efforts to build on top
of pandas for it.

If you want to roll code, you could use decorators or overload pandas methods
like joins to build a provenance graph.

If you don't want to roll code, you can treat the transform as a black box and
define your inputs and outputs in terms of database/store tables.

