It's a bit tangential to the main point, but I do agree with this remark.
I have always found Pandas uncomfortable to work with. I'm never sure if I'm doing things in the most efficient/idiomatic way and I've found it hard to be consistent over time, especially since I've picked up different bits of code from different places.
I've gotten a lot more efficiency out of R, especially the data.table package.
For example, a grouped filter is very cumbersome in pandas.
Interested to hear if you think it gets at the heart of the problem.
I think that is a great point to add to you very interesting article. I wouldn't know which of the two operations is correct to use, and I would not notice anything wrong, or odd with either method in a code review, so this is ripe for adding wrong results in a production environment.
I also appreciate your idea of porting dplyr to python, keep up the good work :)
This table sums up some of it:
operation | time
apply score + 1 | 30s
apply score.values + 1 | 3s
transform score + 1 | 30s
transform score.values + 1 | 20s
It seems to me that pandas is simply a leakier abstraction than dplyr, data.table etc. As a user of the library in most instances you shouldn't have to profile your code to figure out why things behave the way they do (btw, thanks for pointing out snakeviz - it seems like a useful tool).
This being said, we shouldn't complain too much about pandas - it is in the end a very important and useful tool.
If I read "Having many different ways to express the same logic makes it hard for developers to understand programs of heterogeneous styles" in a vacuum, the very first thing I would think of is R.
What's the difference between query() and loc()? Do they evaluate to the same thing under the hood? Is one better than the other? In what cases?
These are questions that don't have obvious answers at first sight.
The problem is that there is no simple logically coherent API to use when you haven't memorized all the shortcuts. And the author only allows "tax form" APIs (what he calls "Pythobic/Pandonic" where every parameter is a single atomic step, so it's laborious to express things like tree-structured queries that are more complex than parameter dictionaries.
I suspect the distinction between tables and data frames (and arguing about query languages) is from people who have lost sight of the important things when dealing with data - the relational model and a system that supports relational algebra. Beyond that, features and optimisations need to be justified in terms of today's (and tomorrow's) needs which are specific to the data. The only major issue with SQL is a lack of support for something like R's tidyverse gather() and spread() operations which could be considered as missing operations in relational algebra.
But I also agree with you. I think we are teaching programming wrong, we start with imperative programming, but perhaps we should start with data modelling. How the data look like? What are the constraints? Can we test the constraints? How many bits I need to store it?
"Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious." -- Fred Brooks
The central design premise for relational data modeling was “if your model could potentially allow inconsistencies, assume the inconsistencies”. Today that premise is easily brushed aside with “you’ll never get anyone manipulating this table without going through this layer” or “we then group by these columns to show unique records to the user”.
I also think another reason is that relational database interfaces don’t really fit in with the architecture people want to use these days. Products like dynamo have secure and operable HTTP interfaces and SDKs that fit in really nicely with the ‘serverless’ stuff. To run a relational database you pretty much need to run a network, which isn’t particularly compatible with such architectures.
The lack of understanding of the relational model is not the limiting factor in my experience, the developer experience is just much worse.
NoSQL databases need you to go through this ceremony too. It may be less obvious if you're just passing JSONs over HTTP all the way, but something still needs to keep connections and sanitize untrusted input.
*I have experience with a system, not designed by me, that had one, and we were always going to redo/split it but never did.
> but I don't see why - in principle - a "serverless", SQL-over-API-as-a-Service couldn't work.
It can, but there’s just not really any good ones. Where as there are products like Dynamo which are amazing from an operability standpoint (as long as your use case doesn’t run up too hard against any of its constraints). AWS Serverless RDS is pretty terrible for example, the engine choices are limited, the scaling story is terrible, it’s expensive, it doesn’t actually have anything close to the “on-demand” functionality described in the marketing material, and the interface is just a mechanism for passing SQL queries around (so you’d probably want to use yet another abstraction layer for constructing your queries). Spanner is pretty good, but it’s really expensive, and isn’t amazing enough on its own to justify moving to GCP unless you’re already there. You can also run an HTTP interface yourself for you RDBMS, but there isn’t really a mature product in that space. There’s no show stopping technical limitations there, it’s just not a well or widely supported feature.
Clearing that bit up:
Cloud is "just someone else's servers".
Serverless is "someone else maintains the infrastructure." It's one step further in the same direction.
If the cloud is akin to renting a flat instead of owning it, then serverless is like living in hotels.
Objects are just a collection of attributes. Table rows are just a collection of attributes. There are some things that objects can do that the relational model can't but if you are designing a database schema why would you care about those things? There is no rule that you have to use every feature of a programming language everywhere. So just use the least common denominator of features and you're good to go.
Complaining about ORIM is like complaining about the universal serial impedance mismatch. You can't plug in USB 3 (OOP) in USB 1.1 (relational) but you can plug in USB 1.1 into USB 3.0 and since we know the old model is still good enough for a lot of use cases (think keyboard and mouse) we still use it even though we have a completely different standard installed into our computers.
What this means in practice is that your Domain/Entity classes look exactly like your database schema, not the other way. You still have to write queries but your ORM makes it ten times easier by offering very convenient query builders that let you build dynamic queries without string concatenation.
It's that the data is represented with sets, and there's an algebra over those that provides (strong) guarantees and principled way of composing operations.
Yes, you use objects as mere key-value pairs and provide a bizarre semantics for the relational algebra over sets of these objects -- but! -- this isn't object orientation.
Eg., in OO objects compose -- in the relational algebra rows dont "compose", eg., even having person.address.street breaks the semantics of 'SELECT'
The interpretation of p.a.s has to be as a subset on a product of relations (ie., tables P and A filtered on a join of p.id to a.id, etc...)
This is one of the key impedance mismatches in OO<->Relations -- composition =/= join. Hence awkward and and ugly workarounds in all ORMs.
On the other hand, the purists* who rant about nulls, I think have missed something as well.
But on the whole, I would like to see something related to SQL, that tries to improve it without grafting things on, that has an overall vision and an approach of simplifying and making it more coherent. Not being committed to the syntax, and wanting to close the loop more between manipulating data and manipulating the language itself.
I don't know, if you say SQL is a functional programming language, ok, and you take some common functional language that has nice syntax and make it consistent with the relational model, what do you get?
*Of which my impression was formed by (IIRC):
"Relational Database Writings, 1989-1991 by C. J. Date"
(which Amazon seems to want over $1,000 for right now, in paperback)
Dataframe don't really have a clear boundary as what it can do or cant do. I see no problem why you can't shoehorn some control flow logic into it which makes it pretty much just an executor of arbitrary computation graph specified in DSL, which relies on however the developer decides to implement it.
I'd rather take SQL because I have a better understanding what it is doing.
Easy in Pandas.
Dataframes are not tables; tables are not dataframes. It's nearly as apples-to-bananas as comparing Python lists to C arrays.
pandas is just poor man's SQL+BI. pandas stores everything in memory and has many limitations.
in SQL Server I can easily churn through terabyte sized database and get the data I need, because the schema is well designed with partitioned tables, clustered indexes and a well designed SQL takes less than a second to run against >1TB database. It even allows a team of more than 20 people to work with the same SQL and query it simultaneously.
i would love to see how you can analyze 1TB csv file, or a pile of 100 csv files totalling over 10TB where you have yet to discover the schema and how tables join together. and I am doing it with on a simple $700 workstation, not even using any hadoop nonsense
I don't know... I honestly feel like 'BI-Solutions' are a poor-persons Python if you are doing anything more than simple dashboards. Something that can be done in 2 lines of code in a Notebook requires endless fiddling in an IDE, to produce something not easily reproducible.
Aside, I've no experience with Tableau or Power-BI, just know that Crystal Reports and SSRS which are pretty painful.
cherry pick what your need and ETL your data out of legacy systems into your warehouse and run something like tableau/looker/powerbi on top and you will be amazed how effective you can be
Then, once you have 'insight' into your Data, you can easily 'do' something with it without the limitations of a tiered product.
db has lot of rows, around 20k rows logged per minute and the events are logged 24/7 three years.
Again, because the schema is well designed, I use clustered index on Date to filter and analyze and the engine actually never reads the whole db all the time. It actually read only the pages I need, and that's the benefit from millions of man-hours MSFT invested in optimizing its SQL engine.
typical response time depends on date period, I try to write queries that dont take more than a 5-10 secs to run.
if you have RDBMS performance problems - just hire an expensive DBA for a brief period and learn-learn-learn from her how to work with large data effectively. DBAs can optimize your db to run as fast as a your hardware I/O speed.
huge=# \timing on
Timing is on.
huge=# copy lotsarows from '~/src/lotsarows/data.csv' with csv header;
Time: 85858.899 ms (01:25.859)
huge=# select count(*) from lotsarows;
Time: 132784.743 ms (02:12.785)
huge=# vacuum analyze lotsarows;
Time: 185040.485 ms (03:05.040)
huge=# select count(*) from lotsarows;
Time: 48622.062 ms (00:48.622)
huge=# select count(*) from lotsarows where a > b and c < d;
Time: 48569.866 ms (00:48.570)
In : import pandas as pd
In : %time df = pd.read_csv('data.csv')
CPU times: user 34.1 s, sys: 4.49 s, total: 38.6 s
Wall time: 38.7 s
In : %time len(df)
CPU times: user 125 µs, sys: 19 µs, total: 144 µs
Wall time: 166 µs
In : %time ((df['a'] > df['b']) & (df['c'] < df['d'])).sum()
CPU times: user 1.74 s, sys: 135 ms, total: 1.88 s
Wall time: 1.88 s
Less pedantically you can quite easily and elegantly drop nans, str->float, perform just about any operation over a rolling window and perform trig functions on it. Generally in a much more sane way than pandas.
In my life as an analyst and data scientist I've found SQL to be far, far superior to pandas for something like 90-95% of tasks.
Yes you can, Postgres support Python UDF btw. Though I don't think that is necessary a daily common feature to use.
SQL is pretty powerful if you look deep
People are idiots! (I am sorry to say that, I don't really mean it, I empathize, everyone sometimes is.)
Yes, dataframe is pretty much just a table. (And yes, GraphQL is a poor reinvention of SQL.) However, to be fair, there are different considerations. Database needs to know things like storage constraints and foreign keys (so you have many different column types), when you're doing just analytics (i.e. pandas), you pretty much only need two types - number and (trimmed) string (and sometimes a datetime, but that's just conveniently formatted number). (I think SAS got that right.)
Anyway, I think the way out of this mess is to have a functional, Haskell-like language for processing tables (not necessarily Turing complete) that would subsume declarativness of SQL, and that could be compiled into different targets. The language would basically specify some (limited, so recursion in that language would not be allowed) processing of tables (of known types), and you could then compose these processing pieces, or integrate them with custom processing in other language.
I understand why people hate SQL, it is on some level hard to write and compose. I think a correctly designed functional language would help here. Or maybe just tear LINQ out of the .NET ecosystem.
I would not consider GraphQL a poor reinvention of SQL, since its niche of decoupling and simplifying untrusted high latency clients is too different for flexible queries created by a trusted server. It competes with REST and RPC, not SQL.
GraphQL's native operations are limited to following a foreign-key link, which has predictable performance (quasi linear in the number of objects in the response) and selecting a subset of fields (reduce the response size and enable field level deprecation and usage tracking). These limitations prevent both malicious clients and less performance concerned front-end developers from putting excessive load on the database. These limitations also allow it to work with both data stored in a database and data generated by an application, while supporting SQL is pretty much limited to being processed by a database.
But I do consider GraphQL somewhat unnecessary, because if those REST APIs composed just like tables do in the database, then you wouldn't need GraphQL, and you could run a normal query. (There is also a problem of externalities, putting the processing costs on the client is cheaper.)
And thanks for pointing out EdgeDB.
You're making the assumption that there is something to run a normal query on. As soon as you write even a single line of server side code this assumption is broken. What if the GraphQL query doesn't actually use an SQL database and just reads something from a file or another service? What if the server is responsible for granting row level access under very complicated rules that cannot be implemented with just SQL and would be completely insecure if it was done on the client? What if you actually need to do things like implement business logic?
What you're talking about is downright nonsensical within that context.
* Postgres has a robust, battle tested role based security model with inheritance.
* Postgres has foreign data wrappers that let you encapsulate external resources as tables that behave the same way as local tables for most use cases.
* Postgres has plugins for most of the popular programming languages.
If you really like GraphQL, the Postgres approach can still give you that too, using Hasura or PostGraphile.
Implementing significant business logic on your data access path has other ramifications, but they're orthogonal to the query syntax chosen. And I say that as someone who thinks GraphQL is a fine idea, especially for reducing churn in API specs when you have separate teams for front and back end dev.
My point is that it isn't "downright nonsensical" at all.
I advocate: Building a relational data model on top of hash-map to achieve a combination of NoSQL and RMDB advantages. This is actually a reverse implementation of PostgreSQL.
[Clojure is a functional programming language based on relational database theory](https://github.com/linpengcheng/PurefunctionPipelineDataflow...)
[Everything is RMDB](https://github.com/linpengcheng/PurefunctionPipelineDataflow...)
[Implement relational data model and programming based on hash-map (NoSQL)](https://github.com/linpengcheng/PurefunctionPipelineDataflow...)
Are there any books that cover the concepts?
TFA does a good job of showing that, while SQL is a cleaner design and theoretically better than the hot mess that is Pandas dataframes in every way, the dataframes API offers a lot of conveniences and ergonomic affordances that can make it more pleasant to program against for certain tasks. As someone who spends a heck of a lot of time using Pandas to prep data for loading into an RDBMS, I agree 1,000%. As someone who, once upon a time, did assembly language programming on both RISC and CISC architectures, and, once upon another time, bounced back and forth between Scheme and Objective-C, this sort of situation doesn't surprise me at all. It's just the standard "worse is better" story that we all know and love to hate.
No experience with GraphQL myself, but this is a very good point. A lot of the practical problems that come with SQL queries boil down to the code that constructs a string to be used as the query, so it can then be deconstructed by different code to execute the query. There's a lot of mistakes that happen right there.
I do think SQL is a well-designed language. But it was designed, first and foremost, as a language for human analysts to bang into a terminal. Having computers flexibly communicate using it is an off-label use.
So: plenty of computational expressivity, but solidly removed from Turing completeness.
Adding types for relational algebra to Haskell is a bit of a slog. Not because it's impossible, but just because the natural way to express those data types is not what comes natural in Haskell.
The RDD paper is one of my favorite papers, and is great bed time reading.
I agree, and in a Haskell-like setting we can have any structured types (rather than just primitives).
That’s why I built hobbes:
No. GraphQL is an RPC specification. There is not a single comparison operator defined it the GraphQL spec. There is no way to join two separate collections. GraphQL was never meant to be an alternative to SQL, and people mainly try to compare them because they both end in "QL".
Either way, you define what you want in a nested/tree-like manner and submit one big-ass request to the server.
The difference is that GraphQL is usually way less verbose and tedious to type out, but they’re fundamentally the same idea.
Pandas has a perlish API designed for abbreviating common tasks, combined with a Pythonic (per common practice, despite being against Putin's stated principles) disregard for informative use of types, instead a simple coherent, principled, layered API. That makes it look like "SQL done wrong" because the semantic differences aren't laid out explicitly.
It does acknowledge other dataframe implementation (Base R and Spark) but most of the arguments about what defines a dataframe are taken from the Pandas API.
don’t know why, made me smile.
Didnt know the DB world is so edgy - its cute.
I say bygons about graphql, usability always trumps fad in the end and “nature will decide”, nature the masses and hordes of “US over Time”. Does GraphQL do warehousing well ?
Well im not militant and think its cool.
One could say, based on my experience at uni with random access machine calcs by hand, that everything a computer does reduces to a table. Anyone with examples of things that arnt essentially tables ?
Well, my definition of table: Fixed number of columns of possibly heterogenous types. Variable number of rows that have all the same type.
So, for starters, a matrix is not a table. Nor is a list, a tree, or a hash table (KV store).
Why do they need to have the same type? In sqlite a field can have a different type in every record/row. (https://www.sqlite.org/datatype3.html). Is having a fixed typed fields fundamental to the concept of a table, or just a property of most SQL implementations?
How can you code know what data to expect without typed field? Your code size can grow up even 10x if you need to assume that any record can have any shape...
You could have more complex and user defined types in an ideal super-SQL, like "int or map:string->bool" etc., but you WANT types. They reduce complexity at all levels! You want them in you code, you want them in your data!
Sure, go ahead, just use MongoDB with no constraints on collections structure... You'll end up with 5x more code in you application and 5x more bugs.
SQL and the relational model help you think more, code less (and slower - but this is an advantage) and have less bugs with less effort! If you're lazy and prefer thinking deeper to "working harder", you'll always prefer properly typed SQL :P (Unless you're working with stuff that's naturally shaped as graphs or tensors.)
Regarding your questions, in case of sqlite3, a _value_ is typed, but a _field_ is not. sqlite3 only has a very small number of types (numbers, text and nulls) and functions (both aggregation and non-aggregation) that expect either number or text do standard SQL casts before evaluation.
In practice the type systems of all common SQL implementations are so weak that their "guarantees" are worth very little for the user/programmer anyway. (I have a hunch that they're really mostly for internal optimizations). All your code need to handle nulls anyway (and if you only use non-nullable fields, users are just going to use their own placeholders/adhoc sentinal values).
Back to the discussion: sqlite3 don't impose any type homogeneity on fields/columns, while Pandas / R dataframes do. So why should that be a characteristic of a "table", let alone one that distinguishes a "table" from a "dataframe"?
bin -> bin,
So what were saying here that random data is stuff we cant meaningfully into a table?
I think thats a good statement.
Tables are non random data. (Crypto keys are again data, because they are bin -> bin in terms encrypt and decrypt, but the number itself isnt a table)
So generics and mixed types is where it hits a wall with tables? Though those mixed structured can just be broken down to collections of tables and enumerated, no? Then call the table of tables a table again?
Also why is a matrix not a table? Because its a super type ?
(I assume this is what he meant, as it is the case in R.)