Hacker News new | past | comments | ask | show | jobs | submit login

This comparison between pandas and SQL (from the pandas site) is a good reference:

https://pandas.pydata.org/pandas-docs/stable/getting_started...

Here's one example:

-- tips by parties of at least 5 diners OR bill total was more than $45 SELECT * FROM tips WHERE size >= 5 OR total_bill > 45;

# tips by parties of at least 5 diners OR bill total was more than $45 In [12]: tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]

another one...

SELECT day, AVG(tip), COUNT(1) FROM tips GROUP BY day;

tips.groupby('day').agg({'tip': np.mean, 'day': np.size})

These are all very simple queries, and things can get very complicated, so this list of comparisons is hardly the last word. I've had to untangle sql, but then again, I've had to untangle python code as well. I do find the SQL expression of these two queries very clear when I read it, whereas I have to expend more mental effort on the others, especially the aggregation. And I think people who don't program would have an easier time reading the SQL as well.

Also, I find it is valuable to be able to run queries directly against a database, or port them to a new environment. Pandas and R data frames are fairly similar, but if you've written them in sql, the transfer is zero effort.

I also find that if you want to join several tables or more, select only a few columns, and do aggregations with HAVING clauses, with vertical stacking perhaps, the pandas code does get considerably more complicated. The SQL does too, and UNION queries are not pretty... but overall, I think the SQL code certainly can come out ahead, and times considerably ahead, in terms of expressing clarity of thought and intent to someone reading the code later.

Oh one other thing you did add a bit of code by selecting into a data frame. With pandasql, you can run sql directly against a data frame and switch back and forth between pandas and sql operations, you don't need the additional step of creating and selecting into a new table. In this case, the code would look like

df_merged = pysqldf("SELECT * FROM df1 JOIN df2 ON df1.common_key = df1.common_key")

This is really nice, since there are some data frame operations that are much simpler in SQL and others that are much simpler in Pandas (actually, a lot of those aggregations would be handled through df.describe() -- but the querying and subsetting getting to that df may be more succinctly expressed in SQL).






Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: