Hacker News new | past | comments | ask | show | jobs | submit login
Is a Dataframe Just a Table? (2019) [pdf] (plateau-workshop.org)
193 points by luu 17 days ago | hide | past | web | favorite | 113 comments



> Having many different ways to express the same logic makes it hard for developers to understand programs of heterogeneous styles. Besides having varying ways to express the same simple logic, the sheer number of APIs (> 200) that are not only overloaded but also have default parameters that may change version to version, making it hard to remember the APIs.

It's a bit tangential to the main point, but I do agree with this remark. I have always found Pandas uncomfortable to work with. I'm never sure if I'm doing things in the most efficient/idiomatic way and I've found it hard to be consistent over time, especially since I've picked up different bits of code from different places.

I've gotten a lot more efficiency out of R, especially the data.table package.


I feel like I'm constantly looking up SO or blog posts that benchmark Pandas methods while I'm coding with Pandas. You have to, since the inefficiency you add with a slower method is nontrivial.


I just finished a lengthy analysis of why pandas groupby operations ends up harder to use than R's dplyr or data.table.

For example, a grouped filter is very cumbersome in pandas.

Interested to hear if you think it gets at the heart of the problem.

https://mchow.com/posts/2020-02-11-dplyr-in-python/


> result length dependent operations: to calculate a mean in this case we have to pass the string “mean” to transform. This tells pandas that the result should be the same length as the original data.

    g_students.score.mean()
has the same length as using `g_students.score.transform('mean')` but the result has different values!

I think that is a great point to add to you very interesting article. I wouldn't know which of the two operations is correct to use, and I would not notice anything wrong, or odd with either method in a code review, so this is ripe for adding wrong results in a production environment.


I really think it does.

I also appreciate your idea of porting dplyr to python, keep up the good work :)

This table sums up some of it:

operation | time

apply score + 1 | 30s

apply score.values + 1 | 3s

transform score + 1 | 30s

transform score.values + 1 | 20s

It seems to me that pandas is simply a leakier abstraction than dplyr, data.table etc. As a user of the library in most instances you shouldn't have to profile your code to figure out why things behave the way they do (btw, thanks for pointing out snakeviz - it seems like a useful tool).

This being said, we shouldn't complain too much about pandas - it is in the end a very important and useful tool.


I mean, perhaps comparing R to panda is too low a bar.

If I read "Having many different ways to express the same logic makes it hard for developers to understand programs of heterogeneous styles" in a vacuum, the very first thing I would think of is R.


Indeed, but I was more specifically referring to smth like data.table, which I found simpler to use than Pandas.


Could you give some examples of this issue with pandas?


Some of the examples provided in the paper are eloquent:

•df[df.a>3]

•df[df["a"]>3]

•df.loc[df.a>3]

•df.loc[df["a"]>3]


Not sure that I'd consider those all that eloquent since it's just the product of 2 different pieces of syntactic sugar (df.a being shorthand for df["a"] and df[<index filter>] shorthand for df.loc[<index filter>]).


Here's another example: https://stackoverflow.com/questions/49936557/pandas-datafram...

What's the difference between query() and loc()? Do they evaluate to the same thing under the hood? Is one better than the other? In what cases?

These are questions that don't have obvious answers at first sight.


Well, that's kind of the point. What's the purpose of the syntactic sugar? Is it just that, or is there some hidden performance difference? This is not clear at first sight.


The point is to "Huffman encode" the API for expressing near-boilerplate. Like unix command names and flags.

The problem is that there is no simple logically coherent API to use when you haven't memorized all the shortcuts. And the author only allows "tax form" APIs (what he calls "Pythobic/Pandonic" where every parameter is a single atomic step, so it's laborious to express things like tree-structured queries that are more complex than parameter dictionaries.


Agree, data.table is the perfect blend of table and dataframe syntaxis for scientists. It's also performant.


Tables and data frames are both leaves in the far more fundamental flow that beginners don't pick up because it is too simple - the relational model of data. The real problem is the the basic normal forms are so obvious and simple it is difficult to tell if people designed around them on purpose or stumbled onto the right path.

I suspect the distinction between tables and data frames (and arguing about query languages) is from people who have lost sight of the important things when dealing with data - the relational model and a system that supports relational algebra. Beyond that, features and optimisations need to be justified in terms of today's (and tomorrow's) needs which are specific to the data. The only major issue with SQL is a lack of support for something like R's tidyverse gather() and spread() operations which could be considered as missing operations in relational algebra.


I think one limitation of what you're saying is that you need to process the data before they become relational model. You need to integrate and normalize, and therefore, at some point, you need to support datasets that are not in normal form. Or perhaps you don't even need to normalize.

But I also agree with you. I think we are teaching programming wrong, we start with imperative programming, but perhaps we should start with data modelling. How the data look like? What are the constraints? Can we test the constraints? How many bits I need to store it?

"Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious." -- Fred Brooks


But that was exactly how programming was taught 30 years ago. Relational data modeling was your starting point. Then object orientation came along and we suffered the dreaded object relational impedance mismatch when we couldn’t translate invoice header to invoice line-items in an object oriented way. Yet somehow the majority of development continued down this design philosophy, abstracting the data model further and further away (it was only heresy for a short while that data models were being generated by tools of process models). Then NoSql. Then graphQL dragging the lamppost of software development further away from understanding data first.

The central design premise for relational data modeling was “if your model could potentially allow inconsistencies, assume the inconsistencies”. Today that premise is easily brushed aside with “you’ll never get anyone manipulating this table without going through this layer” or “we then group by these columns to show unique records to the user”.


I think NoSQL is only as popular as it is because a lot engineers don’t have a proper understanding of the relational model (which isn’t hard to understand at all, it just seems to have fallen out of popularity). I certainly think there are completely valid use cases for denormalized datastores, but I don’t think those use cases are what’s driving their adoption. You can see this in how so many mongo/dynamo... apps end up just being semi-normalized, relational-ish databases.

I also think another reason is that relational database interfaces don’t really fit in with the architecture people want to use these days. Products like dynamo have secure and operable HTTP interfaces and SDKs that fit in really nicely with the ‘serverless’ stuff. To run a relational database you pretty much need to run a network, which isn’t particularly compatible with such architectures.


I think a factor in this is that NoSql databases have nice API's that programmers can use to setup tables, do simple queries etc, which makes it much easier to get started. For RDBMS's you have to muck around with connections and SQL, which is more powerful but requires much more ceremony. (Connection pooling, prepared statements etc)

The lack of understanding of the relational model is not the limiting factor in my experience, the developer experience is just much worse.


Perhaps this is one reason why frameworks like Django are so popular


> but requires much more ceremony. (Connection pooling, prepared statements etc)

NoSQL databases need you to go through this ceremony too. It may be less obvious if you're just passing JSONs over HTTP all the way, but something still needs to keep connections and sanitize untrusted input.


What you say is True, but can be solved by an ORM, though that adds an extra layer of complexity


It's not just ORM. Many NoSQL databases allow for real-time events (query subscriptions), a simple security model, built-in data versioning, built-in sharding. You rarely get any of that out-of-the-box for a traditional RDS.


NoSQL is popular because it's untyped/dynamically-typed. You don't need to do a schema migration every time you insert a new data type.


Judging by my previous managers way of describing things ("give me if x then y"), many people understand the relational model on a basic level, but can't think in sets when describing the output they are looking for ("give me x where Y"). While you can get "if" statemnts in SQL, it's not the way you should be thinkng when doing any sort of non-trivial query.


The dreaded generic object table* is something I think people have put in SQL databases forever and will continue to do so, so it seems superficially logical to me to say "why not just use an object store of some sort?"

*I have experience with a system, not designed by me, that had one, and we were always going to redo/split it but never did.


"Serverless" is just "someone else's servers". I might be missing something big here, but I don't see why - in principle - a "serverless", SQL-over-API-as-a-Service couldn't work.


It’s also an abstraction on top of those servers, that separates your business logic from the underlying architecture to a degree that a lot of people find really appealing.

> but I don't see why - in principle - a "serverless", SQL-over-API-as-a-Service couldn't work.

It can, but there’s just not really any good ones. Where as there are products like Dynamo which are amazing from an operability standpoint (as long as your use case doesn’t run up too hard against any of its constraints). AWS Serverless RDS is pretty terrible for example, the engine choices are limited, the scaling story is terrible, it’s expensive, it doesn’t actually have anything close to the “on-demand” functionality described in the marketing material, and the interface is just a mechanism for passing SQL queries around (so you’d probably want to use yet another abstraction layer for constructing your queries). Spanner is pretty good, but it’s really expensive, and isn’t amazing enough on its own to justify moving to GCP unless you’re already there. You can also run an HTTP interface yourself for you RDBMS, but there isn’t really a mature product in that space. There’s no show stopping technical limitations there, it’s just not a well or widely supported feature.


> "Serverless" is just "someone else's servers".

Clearing that bit up:

Cloud is "just someone else's servers".

Serverless is "someone else maintains the infrastructure." It's one step further in the same direction.


Correct.

If the cloud is akin to renting a flat instead of owning it, then serverless is like living in hotels.


I've never understood the myth of the "object relational impedance mismatch" (ORIM).

Objects are just a collection of attributes. Table rows are just a collection of attributes. There are some things that objects can do that the relational model can't but if you are designing a database schema why would you care about those things? There is no rule that you have to use every feature of a programming language everywhere. So just use the least common denominator of features and you're good to go.

Complaining about ORIM is like complaining about the universal serial impedance mismatch. You can't plug in USB 3 (OOP) in USB 1.1 (relational) but you can plug in USB 1.1 into USB 3.0 and since we know the old model is still good enough for a lot of use cases (think keyboard and mouse) we still use it even though we have a completely different standard installed into our computers.

What this means in practice is that your Domain/Entity classes look exactly like your database schema, not the other way. You still have to write queries but your ORM makes it ten times easier by offering very convenient query builders that let you build dynamic queries without string concatenation.


This misunderstands the relational model -- it doesn't mean "my app's data model". Ie., it's not that there's fields and rows.

It's that the data is represented with sets, and there's an algebra over those that provides (strong) guarantees and principled way of composing operations.

Yes, you use objects as mere key-value pairs and provide a bizarre semantics for the relational algebra over sets of these objects -- but! -- this isn't object orientation.

Eg., in OO objects compose -- in the relational algebra rows dont "compose", eg., even having person.address.street breaks the semantics of 'SELECT'

The interpretation of p.a.s has to be as a subset on a product of relations (ie., tables P and A filtered on a join of p.id to a.id, etc...)

This is one of the key impedance mismatches in OO<->Relations -- composition =/= join. Hence awkward and and ugly workarounds in all ORMs.


I always felt like the relational model, SQL, and the extensions and add-ons like PL/SQL come from people with totally different worldviews and make quite a mess together. I couldn't design something better than PL/SQL, but it just seems so, so wrong to me.

On the other hand, the purists* who rant about nulls, I think have missed something as well.

But on the whole, I would like to see something related to SQL, that tries to improve it without grafting things on, that has an overall vision and an approach of simplifying and making it more coherent. Not being committed to the syntax, and wanting to close the loop more between manipulating data and manipulating the language itself.

I don't know, if you say SQL is a functional programming language, ok, and you take some common functional language that has nice syntax and make it consistent with the relational model, what do you get?

*Of which my impression was formed by (IIRC): "Relational Database Writings, 1989-1991 by C. J. Date" (which Amazon seems to want over $1,000 for right now, in paperback)


i am confused. R's "gather" just merges a set of columns values into a single column. how is that beyond the scope of sql?

It is worse than a table.

Dataframe don't really have a clear boundary as what it can do or cant do. I see no problem why you can't shoehorn some control flow logic into it which makes it pretty much just an executor of arbitrary computation graph specified in DSL, which relies on however the developer decides to implement it.

I'd rather take SQL because I have a better understanding what it is doing.


Can you drop nans, parse strings to floats, apply arbitrary lambdas over a rolling window, take the cosine of those values, and plot a graph against time, in a single line of sane SQL?

Easy in Pandas.

Dataframes are not tables; tables are not dataframes. It's nearly as apples-to-bananas as comparing Python lists to C arrays.


everything you mentioned can be done easily through database schema. window functions work well in SQL. plots are easily done in any BI solution that hooks up to any database.

pandas is just poor man's SQL+BI. pandas stores everything in memory and has many limitations.

in SQL Server I can easily churn through terabyte sized database and get the data I need, because the schema is well designed with partitioned tables, clustered indexes and a well designed SQL takes less than a second to run against >1TB database. It even allows a team of more than 20 people to work with the same SQL and query it simultaneously.

i would love to see how you can analyze 1TB csv file, or a pile of 100 csv files totalling over 10TB where you have yet to discover the schema and how tables join together. and I am doing it with on a simple $700 workstation, not even using any hadoop nonsense


How about working with poorly designed schemas? I work with SQL-Server as well, dealing with legacy data designed around imperative t-sql programming. Our 'BI-Solution', SSRS, crawls on pretty simple queries, where 'hacks' need to be done, joining on same table, all kinds of dirty tricks...

I don't know... I honestly feel like 'BI-Solutions' are a poor-persons Python if you are doing anything more than simple dashboards. Something that can be done in 2 lines of code in a Notebook requires endless fiddling in an IDE, to produce something not easily reproducible.

Aside, I've no experience with Tableau or Power-BI, just know that Crystal Reports and SSRS which are pretty painful.


it's hard dealing with legacy stuff. One alternative I can propose - pitch your management and go get yourself a separate and latest SQL instance just for analytics. Easiest solution you can do is to install SQL Server Developer version which is free.

cherry pick what your need and ETL your data out of legacy systems into your warehouse and run something like tableau/looker/powerbi on top and you will be amazed how effective you can be


Agreed, but again, I ETL my data into a warehouse, as a developer (not a BI person), I'm reaching for spark, flink, or whatever to roll my analytics, and python/flask/d3 for building web dashboards.

Then, once you have 'insight' into your Data, you can easily 'do' something with it without the limitations of a tiered product.


is the database server running on a $700 workstation? how many rows? what types of queries? what is a typical query execution time? interested in your response because I generally find RDBMS performance quite poor, although I've never used SQL server. Pandas gets to be fairly painful after the data size hits 10GB, in my experience. I do think you are missing how pandas fits into a data exploration pipeline for someone who knows how to code - there's plenty of tasks for which SQL+BI is not going to cut it.


Commodity virtualized server with 4 cores and 8GB of RAM, storage is on NAS. We have hundreds of these typical SQL instances.

db has lot of rows, around 20k rows logged per minute and the events are logged 24/7 three years.

Again, because the schema is well designed, I use clustered index on Date to filter and analyze and the engine actually never reads the whole db all the time. It actually read only the pages I need, and that's the benefit from millions of man-hours MSFT invested in optimizing its SQL engine.

typical response time depends on date period, I try to write queries that dont take more than a 5-10 secs to run.

if you have RDBMS performance problems - just hire an expensive DBA for a brief period and learn-learn-learn from her how to work with large data effectively. DBAs can optimize your db to run as fast as a your hardware I/O speed.


I just did a simple benchmark: 67 million rows, integers, 4 columns wide, with postgresql 10 and pandas.

pg 10

  huge=# \timing on
  Timing is on.
  huge=# copy lotsarows from '~/src/lotsarows/data.csv' with csv header;
  COPY 67108864
  Time: 85858.899 ms (01:25.859)
  huge=# select count(*) from lotsarows;
    count                               
  ----------                            
   67108864                             
  (1 row)                               
                                        
  Time: 132784.743 ms (02:12.785)       
  huge=# vacuum analyze lotsarows;      
  VACUUM                                
  Time: 185040.485 ms (03:05.040)       
  huge=# select count(*) from lotsarows;
    count                               
  ----------                            
   67108864                             
  (1 row)                               
                                        
  Time: 48622.062 ms (00:48.622)        
  huge=# select count(*) from lotsarows where a > b and c < d;
    count                       
  ----------                    
   16783490                     
  (1 row)                       
                                
  Time: 48569.866 ms (00:48.570)

pandas

  In [2]: import pandas as pd                                           
                                                                        
  In [3]: %time df = pd.read_csv('data.csv')                            
  CPU times: user 34.1 s, sys: 4.49 s, total: 38.6 s                    
  Wall time: 38.7 s                                                     
                                                                        
  In [4]: %time len(df)                                                 
  CPU times: user 125 µs, sys: 19 µs, total: 144 µs                     
  Wall time: 166 µs                                                     
  Out[4]: 67108864                                                      
                                                                        
  In [5]: %time ((df['a'] > df['b']) & (df['c'] < df['d'])).sum()       
  CPU times: user 1.74 s, sys: 135 ms, total: 1.88 s                    
  Wall time: 1.88 s                                                     
  Out[5]: 16783490


ok, so 20k * 60 minutes * 24 hours * 365 days * 3 years = 31,536,000,000 rows. You are querying 31.5 billion rows on a machine with 4 cores and 8gb ram? Are queries that return in 5-10 seconds running over the entire table? or small portions of it?


small portions of it, sometimes an hour or two, sometimes a day, or a week. most of the times there are 5-6 other conditions. pandas will have to full scan entire dataframe for any query to filter, while SQL uses index seek


Why aren't you or some other sql whiz bringing home all the kaggle prize money being won by people using inefficient dataframes?


i think you are confusing xgboost that uses its own DMatrix that is winning kaggle, not dataframe


Of course you shouldn't use Pandas to analyze terabytes of data, but most people aren't analyzing terabytes of data.


That's what Spark is for. You can do petabyte-scale jobs... with DataFrames.


Technically all SQL compiles into a single line.

Less pedantically you can quite easily and elegantly drop nans, str->float, perform just about any operation over a rolling window and perform trig functions on it. Generally in a much more sane way than pandas.

In my life as an analyst and data scientist I've found SQL to be far, far superior to pandas for something like 90-95% of tasks.


> Can you drop nans, parse strings to floats, apply arbitrary lambdas over a rolling window, take the cosine of those values, and plot a graph against time, in a single line of sane SQL?

Yes you can, Postgres support Python UDF btw. Though I don't think that is necessary a daily common feature to use.

SQL is pretty powerful if you look deep


> What should we make of it?

People are idiots! (I am sorry to say that, I don't really mean it, I empathize, everyone sometimes is.)

Yes, dataframe is pretty much just a table. (And yes, GraphQL is a poor reinvention of SQL.) However, to be fair, there are different considerations. Database needs to know things like storage constraints and foreign keys (so you have many different column types), when you're doing just analytics (i.e. pandas), you pretty much only need two types - number and (trimmed) string (and sometimes a datetime, but that's just conveniently formatted number). (I think SAS got that right.)

Anyway, I think the way out of this mess is to have a functional, Haskell-like language for processing tables (not necessarily Turing complete) that would subsume declarativness of SQL, and that could be compiled into different targets. The language would basically specify some (limited, so recursion in that language would not be allowed) processing of tables (of known types), and you could then compose these processing pieces, or integrate them with custom processing in other language.

I understand why people hate SQL, it is on some level hard to write and compose. I think a correctly designed functional language would help here. Or maybe just tear LINQ out of the .NET ecosystem.


In the re-inventing SQL department, I'd take a look at EdgeQL/EdgeDB. It's not perfect, but much closer to a functional language and composes well.

I would not consider GraphQL a poor reinvention of SQL, since its niche of decoupling and simplifying untrusted high latency clients is too different for flexible queries created by a trusted server. It competes with REST and RPC, not SQL.

GraphQL's native operations are limited to following a foreign-key link, which has predictable performance (quasi linear in the number of objects in the response) and selecting a subset of fields (reduce the response size and enable field level deprecation and usage tracking). These limitations prevent both malicious clients and less performance concerned front-end developers from putting excessive load on the database. These limitations also allow it to work with both data stored in a database and data generated by an application, while supporting SQL is pretty much limited to being processed by a database.


The way I see it, SQL and GraphQL are solving somewhat complementary problems. In SQL, I have a structure (all these tables that possibly have to be joined) in the database and I want to pick something out as a simple result table. In GraphQL, I create the more complex structure on the output.

But I do consider GraphQL somewhat unnecessary, because if those REST APIs composed just like tables do in the database, then you wouldn't need GraphQL, and you could run a normal query. (There is also a problem of externalities, putting the processing costs on the client is cheaper.)

And thanks for pointing out EdgeDB.


>you wouldn't need GraphQL, and you could run a normal query.

You're making the assumption that there is something to run a normal query on. As soon as you write even a single line of server side code this assumption is broken. What if the GraphQL query doesn't actually use an SQL database and just reads something from a file or another service? What if the server is responsible for granting row level access under very complicated rules that cannot be implemented with just SQL and would be completely insecure if it was done on the client? What if you actually need to do things like implement business logic?

What you're talking about is downright nonsensical within that context.


These counterpoints are probably valid for most database systems, but with Postgres it's actually far more efficient to use it as the substrate in which everything else is embedded.

* Postgres has a robust, battle tested role based security model with inheritance.

* Postgres has foreign data wrappers that let you encapsulate external resources as tables that behave the same way as local tables for most use cases.

* Postgres has plugins for most of the popular programming languages.

If you really like GraphQL, the Postgres approach can still give you that too, using Hasura or PostGraphile.


That's a distinction without a difference; you can execute SQL against CSV with the right tool, nothing you need to write yourself, likewise with programmatic data sources.

Implementing significant business logic on your data access path has other ramifications, but they're orthogonal to the query syntax chosen. And I say that as someone who thinks GraphQL is a fine idea, especially for reducing churn in API specs when you have separate teams for front and back end dev.


I'm not trying to move the goalposts -- you've mentioned SQL specifically, and not relational databases in general -- but RDBM systems offer these features, and in a way that's compatible with access via SQL. Foreign data wrappers can present external data sources as if they were SQL views, allowing cross-source relations to be constructed and queried. Complex row level access and business can be implemented in stored procedures and triggers, if a simpler method can't be found. People will debate the extent to which these features ought to be used, but they certainly exist.

My point is that it isn't "downright nonsensical" at all.


You're talking a lot about implementation, when you compare 2 languages. In principle - it's worth inventing a new language when you can express some common patterns clearer/ more naturally. Otherwise - just make a query engine that's very performant for a subset of SQL (SELECT + following foreign-key link), and outright reject all other kinds of SQL.


Everything is RMDB

I advocate: Building a relational data model on top of hash-map to achieve a combination of NoSQL and RMDB advantages. This is actually a reverse implementation of PostgreSQL.

[Clojure is a functional programming language based on relational database theory](https://github.com/linpengcheng/PurefunctionPipelineDataflow...)

[Everything is RMDB](https://github.com/linpengcheng/PurefunctionPipelineDataflow...)

[Implement relational data model and programming based on hash-map (NoSQL)](https://github.com/linpengcheng/PurefunctionPipelineDataflow...)


I agree, but prefer a tree map in combination with natural keys where possible to get ranges & sorting without complicating the design. Then I embed additional tree maps in records to model has-many-relations. And I know SQL very well, but this setup often provides the capabilities I need and is much more convenient to deal with.


Can you elaborate on the topic of a "relational data model on top of a hash map"?

Are there any books that cover the concepts?


"a relational data model on top top of hash-map" is my original idea, and then combine [The Pure Function Pipeline Data Flow v3.0 with Warehouse/Workshop Model](https://github.com/linpengcheng/PurefunctionPipelineDataflow), can perfectly realize the simplicity and unity combination of system architecture.


cockroach is built on top of a key/value index, i believe.

https://www.cockroachlabs.com/blog/sql-in-cockroachdb-mappin...


not the parent (but I like their thinking); I think the key concept is to treat the hash map as just a (potentially primary) index over your data. Of course keeping secondary indices up to date is now the job of the application (or the abstraction layer)


But isnt that just a sql db?


But it is also Nosql (hash-map). It is flexible according to your needs, as sqldb or Nosql or data structure.


the idea is, if for whatever reason you do not have access to a proper DB, can you build your own relational model on top of the bits you have?


OK, so, disclaimer, I haven't ever used GraphQL with live ammunition, but I really don't get the impression that it can be dismissed so easily.

TFA does a good job of showing that, while SQL is a cleaner design and theoretically better than the hot mess that is Pandas dataframes in every way, the dataframes API offers a lot of conveniences and ergonomic affordances that can make it more pleasant to program against for certain tasks. As someone who spends a heck of a lot of time using Pandas to prep data for loading into an RDBMS, I agree 1,000%. As someone who, once upon a time, did assembly language programming on both RISC and CISC architectures, and, once upon another time, bounced back and forth between Scheme and Objective-C, this sort of situation doesn't surprise me at all. It's just the standard "worse is better" story that we all know and love to hate.

I suspect it's similar for GraphQL. For example, the JSON basis for the language does make it awful to write by hand, but, if you're talking APIs here, you may not be doing too much much hand-authoring of queries. And it's going to be a lot easier than SQL to manipulate programmatically, since, particularly in a language like JavaScript or Python, its syntactic structure maps trivially to basic data structures that already exist in the language.


>> if you're talking APIs here, you may not be doing too much much hand-authoring of queries

No experience with GraphQL myself, but this is a very good point. A lot of the practical problems that come with SQL queries boil down to the code that constructs a string to be used as the query, so it can then be deconstructed by different code to execute the query. There's a lot of mistakes that happen right there.


And this is the pain I'm thinking of. Doing dynamically generated SQL that isn't susceptible to SQL injection can get tricky, and ORM frameworks generally only partially solve the problem for a certain subset of possible schemata.

I do think SQL is a well-designed language. But it was designed, first and foremost, as a language for human analysts to bang into a terminal. Having computers flexibly communicate using it is an off-label use.


Have you had a look at Datalog? It's a carefully selected subset of prolog that corresponds to primitive recursive functions.

So: plenty of computational expressivity, but solidly removed from Turing completeness.

Adding types for relational algebra to Haskell is a bit of a slog. Not because it's impossible, but just because the natural way to express those data types is not what comes natural in Haskell.


I think you are describing something close to the Dataset API in Spark. Spark is built on the RDD, a novel data structure that creates transparent concurrency and distribution. Additionally, you can access the same data with 4 APIs, one of which is SQL, and another which is a typed functional API.

The RDD paper is one of my favorite papers, and is great bed time reading.

https://www.usenix.org/system/files/conference/nsdi12/nsdi12...


As a big fan of the relational model (but not so much of SQL) I just wish that pandas didn't tread indices specially but just as a normal columns. Also, multiindices should just be indices on multiple columns. And I should be able to have as many indices as I want.


> Anyway, I think the way out of this mess is to have a functional, Haskell-like language for processing tables

I agree, and in a Haskell-like setting we can have any structured types (rather than just primitives).

That’s why I built hobbes: https://github.com/Morgan-Stanley/hobbes


> And yes, GraphQL is a poor reinvention of SQL.

No. GraphQL is an RPC specification. There is not a single comparison operator defined it the GraphQL spec. There is no way to join two separate collections. GraphQL was never meant to be an alternative to SQL, and people mainly try to compare them because they both end in "QL".


The main similarity is the whole idea of declaratively saying what you want in a single request. In SQL, you use joins or subqueries, and in GraphQL you use use nested edge/node blocks.

Either way, you define what you want in a nested/tree-like manner and submit one big-ass request to the server.

The difference is that GraphQL is usually way less verbose and tedious to type out, but they’re fundamentally the same idea.


I don't think it needs so much a full blown language - its just an api that allows you to feed data in and pull it out for any arbitrary variety of data (including types/classes and links between data), but then has special commands to tell it how to optimise data storage (for speeding up a particular kind of access need, for minimising data storage, etc.)


Sounds like datalog could fit this bill


Summary: no, because row order matters in dataframes. That's why matrix operations are a better computational framework for dataframes than relational algebra.


That's doesn't seem a very defining characteristic. Row order/index is just an implicit primary key column.


The point is which operations you optimize for. Most SQL databases aren't well-optimized for matrix multiplication, but are great with search-style queries, and the opposite for dataframes.


It matters in many non-trivial queries as well.


The article is a nice comparison of pros and cons of Pandas and SQL, but the title and ensuing comments are misleading and off the point, akin to favorite arguments like "is Haskell types just C++ objects"? They have similar and differences, in core semantics and in ergonomics.

Pandas has a perlish API designed for abbreviating common tasks, combined with a Pythonic (per common practice, despite being against Putin's stated principles) disregard for informative use of types, instead a simple coherent, principled, layered API. That makes it look like "SQL done wrong" because the semantic differences aren't laid out explicitly.


> being against Putin's stated principles

walk carefully


A dataframe may be a table, but it can also be a matrix by supporting some matrix operations.


yes, the problem is, it is not a very good table.


Reading the sections 2.2.3 (about joins) and 3.5 (about code comprehension) took me aback! I came with a positive a priori for dataframes, but changed my mind. SQL is far to be perfect, but at least there are sound foundations!


I liked the article, although it kinda conflates SQL and database and Pandas and dataframe.

It does acknowledge other dataframe implementation (Base R and Spark) but most of the arguments about what defines a dataframe are taken from the Pandas API.


Man, I wish there were more honest comparisons of approaches like this one rather than the "shut up, stupid" stackoverflow example he references at the start.



I was unable to learn R because I couldn't understand what a dataframe is. It was irritating that it wasn't defined clearly and there seemed to be no connection to terminology that was familiar to me (relational databases, SQL tables etc.).


There's no other way to say this without sounding rude, but you weren't unable to learn R because you couldn't understand what a dataframe was - you were unable to learn R because you gave up. Blaming a data structure for the failure seems like a bit of a stretch.


That is what I said. I didn't understand what a dataframe was. Note that I didn't really blame the data structure. I said it was an irritating factor. Someone else might not have been as irritated. But if R was closer to what I already knew, learning would have been easier. For me.


In R a dataframe is just a named list of vectors, where each vector has equal length.


Article aside - that LaTeX theme is dope


Yes


This doesn't seem to be from 2016 (some identifiers suggest that it is, but it cites papers up to 2018 and it says that a tweet from 2016 is two years old).


Indeed. "Conference on Very Important Topics 2016" is not a real conference, but placeholder from a template. Maybe it was left behind by accident? The paper is from the PLATEAU Workshop 2019.


Ok, we've added three years to the title above. Thanks to both of you!


Oh I thought "Conference on Very Important Topics" was Twitter


“Now GraphQL has almost 15K GitHub stars and a large and active developer community. What should we make of it?“

don’t know why, made me smile. Didnt know the DB world is so edgy - its cute.

I say bygons about graphql, usability always trumps fad in the end and “nature will decide”, nature the masses and hordes of “US over Time”. Does GraphQL do warehousing well ?

Well im not militant and think its cool.

One could say, based on my experience at uni with random access machine calcs by hand, that everything a computer does reduces to a table. Anyone with examples of things that arnt essentially tables ?


> Anyone with examples of things that arnt essentially tables ?

Well, my definition of table: Fixed number of columns of possibly heterogenous types. Variable number of rows that have all the same type.

So, for starters, a matrix is not a table. Nor is a list, a tree, or a hash table (KV store).


> Variable number of rows that have all the same type.

Why do they need to have the same type? In sqlite a field can have a different type in every record/row. (https://www.sqlite.org/datatype3.html). Is having a fixed typed fields fundamental to the concept of a table, or just a property of most SQL implementations?


How do you aggregate without field types? You'd end up writing custom code for untyped map-reduce...

How can you code know what data to expect without typed field? Your code size can grow up even 10x if you need to assume that any record can have any shape...

You could have more complex and user defined types in an ideal super-SQL, like "int or map:string->bool" etc., but you WANT types. They reduce complexity at all levels! You want them in you code, you want them in your data!

Sure, go ahead, just use MongoDB with no constraints on collections structure... You'll end up with 5x more code in you application and 5x more bugs.

SQL and the relational model help you think more, code less (and slower - but this is an advantage) and have less bugs with less effort! If you're lazy and prefer thinking deeper to "working harder", you'll always prefer properly typed SQL :P (Unless you're working with stuff that's naturally shaped as graphs or tensors.)


None of your remarks come close to answering my question (is having a type definition in the field/schema fundamental to a table, or just a way that most SQL databases decided to implement them)?

Regarding your questions, in case of sqlite3, a _value_ is typed, but a _field_ is not. sqlite3 only has a very small number of types (numbers, text and nulls) and functions (both aggregation and non-aggregation) that expect either number or text do standard SQL casts before evaluation.

In practice the type systems of all common SQL implementations are so weak that their "guarantees" are worth very little for the user/programmer anyway. (I have a hunch that they're really mostly for internal optimizations). All your code need to handle nulls anyway (and if you only use non-nullable fields, users are just going to use their own placeholders/adhoc sentinal values).

Back to the discussion: sqlite3 don't impose any type homogeneity on fields/columns, while Pandas / R dataframes do. So why should that be a characteristic of a "table", let alone one that distinguishes a "table" from a "dataframe"?


Who's to say who gets to write the formal definition? And just because the formal definition claims or doesn't claim something, why does that mean every single implementation has to behave exactly that way?


I think u both complete each other as a full argument with a pro and contra arrangement. Lovely af to read you both honestly.


Interestingly, the tables i know from random access machines are basically of type

bin -> bin,

So what were saying here that random data is stuff we cant meaningfully into a table?

I think thats a good statement.

Tables are non random data. (Crypto keys are again data, because they are bin -> bin in terms encrypt and decrypt, but the number itself isnt a table)


Oh interesting, because i wanted to say “everything is a tensor” before i write the long post

So generics and mixed types is where it hits a wall with tables? Though those mixed structured can just be broken down to collections of tables and enumerated, no? Then call the table of tables a table again?

Also why is a matrix not a table? Because its a super type ?


A matrix is not a table because all the columns need to be the same type. While you can convert a matrix to a table, the converse is not true.

(I assume this is what he meant, as it is the case in R.)


Thanks to GraphQL, the DB world is edgy and nodey.


This guy's writing is really inconsistent. "v." and "vs.", plus singular/plural disagreements all over the place. I think the CS/ML/EE fields need to raise their standards w.r.t. editing. Interesting paper though.




Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: