Hacker News new | past | comments | ask | show | jobs | submit login

Joins for pd.DataFrame.merge() and SQL are explained in exactly the same way, it doesn't matter which order you learn them in. I learned them first in pandas and found the syntax to be more intuitive there.





I'm quite frankly stunned to be reading this. In my experience in industry it's almost universal that pandas syntax is overall pretty horrendous.

SQL is light years more intuitive in my eyes.


  df_merged = df1.merge(df2, on="common_key")
or

  SELECT * INTO df_merged
  FROM (SELECT * 
    FROM df1
    JOIN df2
    ON df1.common_key = df1.common_key)
Python and its mature libraries are vastly more concise than most alternatives, SQL included. Especially when you get to beginner Pandas vs. beginner SQL. The latter is an absolute horror show.

Absolutely disagree. Pandas is the horror show.

create table df_merged as (select from df1 join df2 using(common_key))

Perhaps a bit more verbose but infinitely more readable and parseable.

How about this one. Add a between join i.e. "AND where df1.datecol between df2.datecol - interval '30 day' and df2.datecol + interval '30' day" in pandas?

This is an extremely common operation in analytics, by no means an esoteric use case.


You don’t need the “INTO df_merged” or entire second line, right?

This comparison between pandas and SQL (from the pandas site) is a good reference:

https://pandas.pydata.org/pandas-docs/stable/getting_started...

Here's one example:

-- tips by parties of at least 5 diners OR bill total was more than $45 SELECT * FROM tips WHERE size >= 5 OR total_bill > 45;

# tips by parties of at least 5 diners OR bill total was more than $45 In [12]: tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]

another one...

SELECT day, AVG(tip), COUNT(1) FROM tips GROUP BY day;

tips.groupby('day').agg({'tip': np.mean, 'day': np.size})

These are all very simple queries, and things can get very complicated, so this list of comparisons is hardly the last word. I've had to untangle sql, but then again, I've had to untangle python code as well. I do find the SQL expression of these two queries very clear when I read it, whereas I have to expend more mental effort on the others, especially the aggregation. And I think people who don't program would have an easier time reading the SQL as well.

Also, I find it is valuable to be able to run queries directly against a database, or port them to a new environment. Pandas and R data frames are fairly similar, but if you've written them in sql, the transfer is zero effort.

I also find that if you want to join several tables or more, select only a few columns, and do aggregations with HAVING clauses, with vertical stacking perhaps, the pandas code does get considerably more complicated. The SQL does too, and UNION queries are not pretty... but overall, I think the SQL code certainly can come out ahead, and times considerably ahead, in terms of expressing clarity of thought and intent to someone reading the code later.

Oh one other thing you did add a bit of code by selecting into a data frame. With pandasql, you can run sql directly against a data frame and switch back and forth between pandas and sql operations, you don't need the additional step of creating and selecting into a new table. In this case, the code would look like

df_merged = pysqldf("SELECT * FROM df1 JOIN df2 ON df1.common_key = df1.common_key")

This is really nice, since there are some data frame operations that are much simpler in SQL and others that are much simpler in Pandas (actually, a lot of those aggregations would be handled through df.describe() -- but the querying and subsetting getting to that df may be more succinctly expressed in SQL).




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: