
Is a Dataframe Just a Table? (2019) [pdf] - luu
https://plateau-workshop.org/assets/papers-2019/10.pdf
======
leto_ii
> _Having many different ways to express the same logic makes it hard for
> developers to understand programs of heterogeneous styles. Besides having
> varying ways to express the same simple logic, the sheer number of APIs ( >
> 200) that are not only overloaded but also have default parameters that may
> change version to version, making it hard to remember the APIs._

It's a bit tangential to the main point, but I do agree with this remark. I
have always found Pandas uncomfortable to work with. I'm never sure if I'm
doing things in the most efficient/idiomatic way and I've found it hard to be
consistent over time, especially since I've picked up different bits of code
from different places.

I've gotten a lot more efficiency out of R, especially the data.table package.

~~~
closed
I just finished a lengthy analysis of why pandas groupby operations ends up
harder to use than R's dplyr or data.table.

For example, a grouped filter is very cumbersome in pandas.

Interested to hear if you think it gets at the heart of the problem.

[https://mchow.com/posts/2020-02-11-dplyr-in-
python/](https://mchow.com/posts/2020-02-11-dplyr-in-python/)

~~~
wodenokoto
> result length dependent operations: to calculate a mean in this case we have
> to pass the string “mean” to transform. This tells pandas that the result
> should be the same length as the original data.
    
    
        g_students.score.mean()
    

has the same length as using `g_students.score.transform('mean')` but the
result has different values!

I think that is a great point to add to you very interesting article. I
wouldn't know which of the two operations is correct to use, and I would not
notice anything wrong, or odd with either method in a code review, so this is
ripe for adding wrong results in a production environment.

------
roenxi
Tables and data frames are both leaves in the far more fundamental flow that
beginners don't pick up because it is too simple - the relational model of
data. The real problem is the the basic normal forms are so obvious and simple
it is difficult to tell if people designed around them on purpose or stumbled
onto the right path.

I suspect the distinction between tables and data frames (and arguing about
query languages) is from people who have lost sight of the important things
when dealing with data - the relational model and a system that supports
relational algebra. Beyond that, features and optimisations need to be
justified in terms of today's (and tomorrow's) needs which are specific to the
data. The only major issue with SQL is a lack of support for something like
R's tidyverse gather() and spread() operations which could be considered as
missing operations in relational algebra.

~~~
js8
I think one limitation of what you're saying is that you need to process the
data before they become relational model. You need to integrate and normalize,
and therefore, at some point, you need to support datasets that are not in
normal form. Or perhaps you don't even need to normalize.

But I also agree with you. I think we are teaching programming wrong, we start
with imperative programming, but perhaps we should start with data modelling.
How the data look like? What are the constraints? Can we test the constraints?
How many bits I need to store it?

"Show me your flowcharts and conceal your tables, and I shall continue to be
mystified. Show me your tables, and I won’t usually need your flowcharts;
they’ll be obvious." \-- Fred Brooks

~~~
ak39
But that was exactly how programming was taught 30 years ago. Relational data
modeling was your starting point. Then object orientation came along and we
suffered the dreaded object relational impedance mismatch when we couldn’t
translate invoice header to invoice line-items in an object oriented way. Yet
somehow the majority of development continued down this design philosophy,
abstracting the data model further and further away (it was only heresy for a
short while that data models were being _generated_ by tools of process
models). Then NoSql. Then graphQL dragging the lamppost of software
development further away from understanding data first.

The central design premise for relational data modeling was “if your model
could potentially allow inconsistencies, assume the inconsistencies”. Today
that premise is easily brushed aside with “you’ll never get anyone
manipulating this table without going through this layer” or “we then group by
these columns to show unique records to the user”.

~~~
AmericanChopper
I think NoSQL is only as popular as it is because a lot engineers don’t have a
proper understanding of the relational model (which isn’t hard to understand
at all, it just seems to have fallen out of popularity). I certainly think
there are completely valid use cases for denormalized datastores, but I don’t
think those use cases are what’s driving their adoption. You can see this in
how so many mongo/dynamo... apps end up just being semi-normalized,
relational-ish databases.

I also think another reason is that relational database interfaces don’t
really fit in with the architecture people want to use these days. Products
like dynamo have secure and operable HTTP interfaces and SDKs that fit in
really nicely with the ‘serverless’ stuff. To run a relational database you
pretty much need to run a network, which isn’t particularly compatible with
such architectures.

~~~
synthc
I think a factor in this is that NoSql databases have nice API's that
programmers can use to setup tables, do simple queries etc, which makes it
much easier to get started. For RDBMS's you have to muck around with
connections and SQL, which is more powerful but requires much more ceremony.
(Connection pooling, prepared statements etc)

The lack of understanding of the relational model is not the limiting factor
in my experience, the developer experience is just much worse.

~~~
collyw
What you say is True, but can be solved by an ORM, though that adds an extra
layer of complexity

~~~
aeternum
It's not just ORM. Many NoSQL databases allow for real-time events (query
subscriptions), a simple security model, built-in data versioning, built-in
sharding. You rarely get any of that out-of-the-box for a traditional RDS.

------
tanilama
It is worse than a table.

Dataframe don't really have a clear boundary as what it can do or cant do. I
see no problem why you can't shoehorn some control flow logic into it which
makes it pretty much just an executor of arbitrary computation graph specified
in DSL, which relies on however the developer decides to implement it.

I'd rather take SQL because I have a better understanding what it is doing.

~~~
kortex
Can you drop nans, parse strings to floats, apply arbitrary lambdas over a
rolling window, take the cosine of those values, and plot a graph against
time, in a single line of sane SQL?

Easy in Pandas.

Dataframes are not tables; tables are not dataframes. It's nearly as apples-
to-bananas as comparing Python lists to C arrays.

~~~
somurzakov
everything you mentioned can be done easily through database schema. window
functions work well in SQL. plots are easily done in any BI solution that
hooks up to any database.

pandas is just poor man's SQL+BI. pandas stores everything in memory and has
many limitations.

in SQL Server I can easily churn through terabyte sized database and get the
data I need, because the schema is well designed with partitioned tables,
clustered indexes and a well designed SQL takes less than a second to run
against >1TB database. It even allows a team of more than 20 people to work
with the same SQL and query it simultaneously.

i would love to see how you can analyze 1TB csv file, or a pile of 100 csv
files totalling over 10TB where you have yet to discover the schema and how
tables join together. and I am doing it with on a simple $700 workstation, not
even using any hadoop nonsense

~~~
ezzzzz
How about working with poorly designed schemas? I work with SQL-Server as
well, dealing with legacy data designed around imperative t-sql programming.
Our 'BI-Solution', SSRS, crawls on pretty simple queries, where 'hacks' need
to be done, joining on same table, all kinds of dirty tricks...

I don't know... I honestly feel like 'BI-Solutions' are a poor-persons Python
if you are doing anything more than simple dashboards. Something that can be
done in 2 lines of code in a Notebook requires endless fiddling in an IDE, to
produce something not easily reproducible.

Aside, I've no experience with Tableau or Power-BI, just know that Crystal
Reports and SSRS which are pretty painful.

~~~
somurzakov
it's hard dealing with legacy stuff. One alternative I can propose - pitch
your management and go get yourself a separate and latest SQL instance just
for analytics. Easiest solution you can do is to install SQL Server Developer
version which is free.

cherry pick what your need and ETL your data out of legacy systems into your
warehouse and run something like tableau/looker/powerbi on top and you will be
amazed how effective you can be

~~~
ezzzzz
Agreed, but again, I ETL my data into a warehouse, as a developer (not a BI
person), I'm reaching for spark, flink, or whatever to roll my analytics, and
python/flask/d3 for building web dashboards.

Then, once you have 'insight' into your Data, you can easily 'do' something
with it without the limitations of a tiered product.

------
js8
> What should we make of it?

People are idiots! (I am sorry to say that, I don't really mean it, I
empathize, everyone sometimes is.)

Yes, dataframe is pretty much just a table. (And yes, GraphQL is a poor
reinvention of SQL.) However, to be fair, there are different considerations.
Database needs to know things like storage constraints and foreign keys (so
you have many different column types), when you're doing just analytics (i.e.
pandas), you pretty much only need two types - number and (trimmed) string
(and sometimes a datetime, but that's just conveniently formatted number). (I
think SAS got that right.)

Anyway, I think the way out of this mess is to have a functional, Haskell-like
language for processing tables (not necessarily Turing complete) that would
subsume declarativness of SQL, and that could be compiled into different
targets. The language would basically specify some (limited, so recursion in
that language would not be allowed) processing of tables (of known types), and
you could then compose these processing pieces, or integrate them with custom
processing in other language.

I understand why people hate SQL, it is on some level hard to write and
compose. I think a correctly designed functional language would help here. Or
maybe just tear LINQ out of the .NET ecosystem.

~~~
CodesInChaos
In the re-inventing SQL department, I'd take a look at EdgeQL/EdgeDB. It's not
perfect, but much closer to a functional language and composes well.

I would not consider GraphQL a poor reinvention of SQL, since its niche of
decoupling and simplifying untrusted high latency clients is too different for
flexible queries created by a trusted server. It competes with REST and RPC,
not SQL.

GraphQL's native operations are limited to following a foreign-key link, which
has predictable performance (quasi linear in the number of objects in the
response) and selecting a subset of fields (reduce the response size and
enable field level deprecation and usage tracking). These limitations prevent
both malicious clients and less performance concerned front-end developers
from putting excessive load on the database. These limitations also allow it
to work with both data stored in a database and data generated by an
application, while supporting SQL is pretty much limited to being processed by
a database.

~~~
js8
The way I see it, SQL and GraphQL are solving somewhat complementary problems.
In SQL, I have a structure (all these tables that possibly have to be joined)
in the database and I want to pick something out as a simple result table. In
GraphQL, I create the more complex structure on the output.

But I do consider GraphQL somewhat unnecessary, because if those REST APIs
composed just like tables do in the database, then you wouldn't need GraphQL,
and you could run a normal query. (There is also a problem of externalities,
putting the processing costs on the client is cheaper.)

And thanks for pointing out EdgeDB.

~~~
imtringued
>you wouldn't need GraphQL, and you could run a normal query.

You're making the assumption that there is something to run a normal query on.
As soon as you write even a single line of server side code this assumption is
broken. What if the GraphQL query doesn't actually use an SQL database and
just reads something from a file or another service? What if the server is
responsible for granting row level access under very complicated rules that
cannot be implemented with just SQL and would be completely insecure if it was
done on the client? What if you actually need to do things like implement
business logic?

What you're talking about is downright nonsensical within that context.

~~~
CuriouslyC
These counterpoints are probably valid for most database systems, but with
Postgres it's actually far more efficient to use it as the substrate in which
everything else is embedded.

* Postgres has a robust, battle tested role based security model with inheritance.

* Postgres has foreign data wrappers that let you encapsulate external resources as tables that behave the same way as local tables for most use cases.

* Postgres has plugins for most of the popular programming languages.

If you really like GraphQL, the Postgres approach can still give you that too,
using Hasura or PostGraphile.

------
SatvikBeri
Summary: no, because row order matters in dataframes. That's why matrix
operations are a better computational framework for dataframes than relational
algebra.

~~~
gpderetta
That's doesn't seem a very defining characteristic. Row order/index is just an
implicit primary key column.

~~~
SatvikBeri
The point is which operations you optimize for. Most SQL databases aren't
well-optimized for matrix multiplication, but are great with search-style
queries, and the opposite for dataframes.

------
gowld
The article is a nice comparison of pros and cons of Pandas and SQL, but the
title and ensuing comments are misleading and off the point, akin to favorite
arguments like "is Haskell types just C++ objects"? They have similar and
differences, in core semantics and in ergonomics.

Pandas has a perlish API designed for abbreviating common tasks, combined with
a Pythonic (per common practice, despite being against Putin's stated
principles) disregard for informative use of types, instead a simple coherent,
principled, layered API. That makes it look like "SQL done wrong" because the
semantic differences aren't laid out explicitly.

~~~
Iwan-Zotow
> being against Putin's stated principles

walk carefully

------
playing_colours
A dataframe may be a table, but it can also be a matrix by supporting some
matrix operations.

~~~
gpderetta
yes, the problem is, it is not a very good table.

------
dwenzek
Reading the sections 2.2.3 (about joins) and 3.5 (about code comprehension)
took me aback! I came with a positive a priori for dataframes, but changed my
mind. SQL is far to be perfect, but at least there are sound foundations!

------
wodenokoto
I liked the article, although it kinda conflates SQL and database and Pandas
and dataframe.

It does acknowledge other dataframe implementation (Base R and Spark) but most
of the arguments about what defines a dataframe are taken from the Pandas API.

------
commandlinefan
Man, I wish there were more honest comparisons of approaches like this one
rather than the "shut up, stupid" stackoverflow example he references at the
start.

~~~
eoakes
she*

[http://yifanwu.net/](http://yifanwu.net/)

------
kyberias
I was unable to learn R because I couldn't understand what a dataframe is. It
was irritating that it wasn't defined clearly and there seemed to be no
connection to terminology that was familiar to me (relational databases, SQL
tables etc.).

~~~
rankam
There's no other way to say this without sounding rude, but you weren't unable
to learn R because you couldn't understand what a dataframe was - you were
unable to learn R because you gave up. Blaming a data structure for the
failure seems like a bit of a stretch.

~~~
kyberias
That is what I said. I didn't understand what a dataframe was. Note that I
didn't really blame the data structure. I said it was an irritating factor.
Someone else might not have been as irritated. But if R was closer to what I
already knew, learning would have been easier. For me.

------
sammycdubs
Article aside - that LaTeX theme is dope

------
chenster
Yes

------
tsbinz
This doesn't seem to be from 2016 (some identifiers suggest that it is, but it
cites papers up to 2018 and it says that a tweet from 2016 is two years old).

~~~
MrManatee
Indeed. "Conference on Very Important Topics 2016" is not a real conference,
but placeholder from a template. Maybe it was left behind by accident? The
paper is from the PLATEAU Workshop 2019.

~~~
dang
Ok, we've added three years to the title above. Thanks to both of you!

------
idclip
“Now GraphQL has almost 15K GitHub stars and a large and active developer
community. What should we make of it?“

don’t know why, made me smile. Didnt know the DB world is so edgy - its cute.

I say bygons about graphql, usability always trumps fad in the end and “nature
will decide”, nature the masses and hordes of “US over Time”. Does GraphQL do
warehousing well ?

Well im not militant and think its cool.

One could say, based on my experience at uni with random access machine calcs
by hand, that everything a computer does reduces to a table. Anyone with
examples of things that arnt essentially tables ?

~~~
js8
> Anyone with examples of things that arnt essentially tables ?

Well, my definition of table: Fixed number of columns of possibly heterogenous
types. Variable number of rows that have all the same type.

So, for starters, a matrix is not a table. Nor is a list, a tree, or a hash
table (KV store).

~~~
em500
> Variable number of rows that have all the same type.

Why do they need to have the same type? In sqlite a field can have a different
type in every record/row.
([https://www.sqlite.org/datatype3.html](https://www.sqlite.org/datatype3.html)).
Is having a fixed typed fields fundamental to the concept of a table, or just
a property of most SQL implementations?

~~~
nnq
How do you _aggregate_ without field types? You'd end up writing custom code
for untyped map-reduce...

How can you code know what data to expect without typed field? Your code size
can grow up even 10x if you need to assume that any record can have any
shape...

You could have _more complex and user defined types_ in an ideal super-SQL,
like "int or map:string->bool" etc., but you WANT types. They reduce
complexity at all levels! You want them in you code, you want them in your
data!

Sure, go ahead, just use MongoDB with no constraints on collections
structure... You'll end up with 5x more code in you application and 5x more
bugs.

SQL and the relational model help you _think more_ , code _less_ (and slower -
but this is an advantage) and _have less bugs with less effort!_ If you're
lazy and prefer _thinking deeper_ to _" working harder"_, you'll always prefer
properly typed SQL :P (Unless you're working with stuff that's naturally
shaped as graphs or tensors.)

~~~
em500
None of your remarks come close to answering my question (is having a type
definition in the field/schema fundamental to a table, or just a way that most
SQL databases decided to implement them)?

Regarding your questions, in case of sqlite3, a _value_ is typed, but a
_field_ is not. sqlite3 only has a very small number of types (numbers, text
and nulls) and functions (both aggregation and non-aggregation) that expect
either number or text do standard SQL casts before evaluation.

In practice the type systems of all common SQL implementations are so weak
that their "guarantees" are worth very little for the user/programmer anyway.
(I have a hunch that they're really mostly for internal optimizations). All
your code need to handle nulls anyway (and if you only use non-nullable
fields, users are just going to use their own placeholders/adhoc sentinal
values).

Back to the discussion: sqlite3 don't impose any type homogeneity on
fields/columns, while Pandas / R dataframes do. So why should that be a
characteristic of a "table", let alone one that distinguishes a "table" from a
"dataframe"?

~~~
shawnz
Who's to say who gets to write the formal definition? And just because the
formal definition claims or doesn't claim something, why does that mean every
single implementation has to behave exactly that way?

------
r-zip
This guy's writing is really inconsistent. "v." and "vs.", plus
singular/plural disagreements all over the place. I think the CS/ML/EE fields
need to raise their standards w.r.t. editing. Interesting paper though.

