Hacker News new | past | comments | ask | show | jobs | submit login

I've personally grown to love SQL and I think it is by far the clearest (if verbose) way to describe data transformation work. I learned traditional programming (e.g. languages like Python) long before I stumbled on SQL, but I'm not sure I could've understood Pandas/R as well without having learned SQL, particularly the concepts of joins.

That said, my affection for SQL correlates with my overall growth as a software engineer, particularly grokking the concept of piping data between scripts: e.g. writing a SQL script to extract, filter, and transform data, and then a separate Python/R script that creates a visualization. I think SQL seems so unattractive to data scientists (i.e. people who may not be particularly focused on software development) because it seems so alien and separate from the frameworks (R/Pandas) they use to do the actual analysis and visualization. And so it's understandable to just learn how to do it all in R/Pandas, rather than a separate language and paradigm for data wrangling.

I think it might go deeper than that. I wear a lot of hats - part software engineer, part data engineer, part data scientist - and I find that the only perspective from which RDBMSes feel like the clear front-runner is when I'm wearing the software engineer hat.

When I'm wearing the data engineer hat, I start to get interested in some of the options that aren't even databases at all, such as storing the data in ORC files. Because they sometimes offer big performance advantages. The data intake pipelines I'm working on are batch-oriented, so they benefit little from how RDBMSes are optimized for random access to records, and maybe also ranges of records, but only at the cost of write performance if you're looking to filter on a natural key instead of a synthetic key. . . it gets thorny. ORC is less-than-stellar at selective ad-hoc queries, but I'm not doing selective ad-hoc queries against the internals of my data intake pipeline, and I will fight to the death to prevent the BI team from doing so.

When I'm wearing my data scientist hat, I've got no idea what my next query on the data is, so there's no indexing strategy that could possibly meet my needs. That kills one of the RDBMS's biggest value propositions. So let's just wing it with map-reduce and get on with life. And, TBH, SQL is just too declarative for the way my brain works when I'm in analyst mode - the semi-imperative workflow I get out of something like Spark SQL or R or Pandas feels more natural.

Have you considered snowflake or redshift? I've gone very far ahead with just simple SQL using these tools for a variety of DE and DS projects. Once even managed to get an ML model running on redshift.

If that is your jam, check out BigQuery' ML in SQL

I wish SQL had syntax more similar to dplyr[1] for complicated queries. I dislike having to do a nested sub-query every time I re-group the data - the query ends up being a deeply indented mess that needs to be read from the inside-out. Using "WITH" clauses helps a lot, but it's still not as easy to read as the more imperative dplyr style.

[1] https://dplyr.tidyverse.org/

Be wary of WITH statements as they are optimization barriers as some engines can't optimize them -- that's changing but just FYI. I believe the latest Postgres release is handling them properly now -- I could be wrong.

Good note. Subqueries are sometimes optimization barriers as well - it depends on query engine. If you care about performance you need to look at the query plan.

"If you care about performance you need to look at the query plan." Always I would say, once your data gets above toy size.

we use azure data explorer for a lot of data storage. It is a terribly boring name but it highly performant at a large range of data sizes, stores rectangular (table) data with clear schematization, and uses the piping type of syntax of tidyverse. I love it. I really wish they would rebrand and market it more.

SQL is indeed a powerful too. But it's just one of many. Just today I used: SQL to dump data to csv, python to modify it, and gRPCurl to upload it to another service. Different tools, different problems, use the right one.

That being said ... RDBMs still suffer from scale and there are many efforts to fix this with sharding to disaggregating the data layer from the query layer with examples like AWS's Aurora and GCP's BigQuery etc -- in all likelihood data will likely live in many different places and will need to be adapted into things like a datastore or queried from things like Hive and Presto, or even a traditional ETL + EDW setup -- my point being there's no one thing that will solve everything as most single solutions break down when the number of rows is measured in the billions or more.

> SQL to dump data

Is this another way to say you use SQL to query specific data from a larger dataset?

I imagine they mean SQL to unload from whatever SQL storage engine out to blob(s3,azure,etc), HDFS or local-file system.

COPY INTO s3://mybucket/foobar/ from (select a, b, c from ...);

Yeah sorry. I used the cli front end to my database from a container and had it dump the results to a file that I then massaged with python

> I've personally grown to love SQL and I think it is by far the clearest (if verbose) way to describe data transformation work.

I'm going to go against the grain here, at least as expressed in this thread, and say something that I think is defensible, but...

SQL is a terrible language.

1. It's old. There have been many, many advancements in programming languages since its inception in the mid-80s, none of which have made it into the language itself.

2. SQL is naturally tedious, with few mechanisms for isolating common functionality into e.g. objects/interfaces/functions/monads. Stored procedures exist, but still suffer from the other shortcomings of the language.

3. There's no vendor-agnostic way to write unit tests.

4. The code itself is ugly. There isn't a standardized (or even semi-standardized) way to confidently format code, compared to languages like Python or Java. I've seen (and very occasionally written) code in languages like Ruby or Java that is elegant to the point of being artistic. SQL is almost universally ugly.

5. Over time SQL tends towards becoming unmaintainably complex, for reasons I've come to believe are endemic to the language itself. In my experience and almost without exception, SQL has been the source of the most complex and difficult to maintain/test code in every codebase I've worked with. Views that are 300+ lines long, with multiple layers of inner joins, are common. And no one wants to touch these monsters because it's almost inevitable that refactoring will break something way over yonder.

6. There's no type checking.

7. There's no standardized library sharing framework like RubyGems or Python's pip.

8. IDE support is limited, and is mainly limited to basic code formatting. Compare this to something like IntelliJ where you can just ctrl-click on a method or variable and go to the definition. Even TypeScript is better here.

9. Refactoring is dangerous, and the ability for your tools to lend a hand is limited. Want to rename a public method in Java? Right click. You can be pretty confident doing so won't break your app (with caveats). If you want to rename a column in SQL, you're not going to have nearly that amount of confidence.

Now, I think there is a place for something like SQL: fast, able to easily access and update large amounts of data, etc. Something that executes directly against an RDBMS.

I just wish it wasn't SQL.

I hate point-by-point rebuttals, but here goes mine anyway ;)

> 1. It's old. There have been many, many advancements in programming languages since its inception in the mid-80s, none of which have made it into the language itself.

It's even older than that (around 1979), but being old isn't an argument in itself.

> 2. objects/interfaces/functions/monads, Stored procedures

Not the job of a database

> 3. There's no vendor-agnostic way to write unit tests.

As compared to alternative NoSQL database code that isn't portable at all? Agree though that the testing culture in SQL land could be improved.

> 4. The code itself is ugly.

Entirely a subjective matter. Programming != poetry.

> 5. Over time SQL tends towards becoming unmaintainably complex

If queries are complex in a language that is already very compact compared to equivalent procedural code, then chances are they're complex because the business problem is irreducibly complex, and another langauge won't save you here.

> 6. There's no type checking.

Of course there is! You can't insert character data into number columns for example (SQLite handles this a bit different though), and will receive proper type-related error message on eg. date functions.

> 7. There's no standardized library sharing framework like RubyGems or Python's pip.

What exactly would a stdlib (of table definitions?) be useful for? The standard is the SQL language itself. Maybe the state of portable SQL scripting could be improved by eg. lifting the syntax of eg. Oracle SQLPlus, also implemented by DB/2 since a couple years, or another syntax into the ISO SQL standard.

> 8. IDE support is limited, and is mainly limited to basic code formatting.

Last I checked, IDE support for SQL was quite good on Eclipse, including autocompletion. When you assemble SQL from strings in your app, there's a limit to what an IDE can do.

> 9. Refactoring is dangerous

SQL/relational algebra has a very powerful construct known as "views" - a consistent relational interface for your apps to work against. Refactoring is as good or bad as you make it to be.

I think SQL works well for its problem domain and agree with everything you said.

> It's even older than that (around 1979), but being old isn't an argument in itself.

I wanted to add an additional comment on this point. What is it with old automatically equalling bad? I definitely appreciate where improvements can be made, but I think as an industry we incorrectly think there is something wrong with an old/mature technology. Throwing things away just because they are old hinders progress since we're constantly rewriting tested, working code. I'll get off my soapbox now.

> What is it with old automatically equalling bad?

That's not what was said. What was said that it's old and that it hasn't benefited from decades of research on languages and the way programmers are most productive.

> Now, I think there is a place for something like SQL: fast, able to easily access and update large amounts of data, etc. Something that executes directly against an RDBMS.

Yes, I don't necessarily disagree with anything you said. I didn't mean to imply that SQL was ideal for very good beyond a fairly limited set of functionality, which I would loosely describe as: a system for which many stored operations and refactoring becomes a necessity.

The use case I have in mind is with wrangling multiple, disparate datasets, into a normalized format that can be piped into a visualization function, or into a Rails app (for which an ORM like ActiveRecord can suffice going forward). And also, I'm thinking of people relatively new to data in general. I find SQL as a language offers a more explicit and rigid way of communicating data transformation concepts. After that, for most people who end up learning pandas/R, it's generally easier to do all the work in those frameworks (especially if you like notebooks).

I also didn't like SQL before, but I've come to realize it's mainly because there's a boundary/interface between set-based modeling and imperative/functional based modeling. The former models things as records and relations, and the latter two models things as data structures. And it's this mismatch that ORMs were suppose to solve, but it's been a quagmire, and often called the Vietnam of Computer Science.

However, I've come to appreciate it, and it's pretty great for what it gets you. Databases use to be very data structure specific. So if you wanted to migrate to a different database, you'd have to map the data from one db's data structure implementation to another!

Queries also use to be path dependent! That means you could only access data through a specific object, like json or XML. This can be problematic if you haven't fully figured out the query patterns. With relational, you can be flexible with queries, and don't have to know them ahead of time.

Lastly, when you think about all the imperative/functional code you'd have to write to execute a query with joins, it's pretty great that you can just express it declaratively.

As for some of your reasons, I'll just pick a few of my reactions: 5) a 300+ line program is small compared to most imperative code bases out there. But it is dense, I concede. 6) type checking is enforced in the schema 9) This seems to be property of code that works with data. Unless the data has a way of carrying versions and schema changes with itself, I'm not sure how downstream code of any paradigm would know something has changed.

Now, I think SQL could be improved, just not for the reasons you've given. Just because a language is old, doesn't mean it doesn't have good idea. Lisp is one of the oldest languages out there, and lots of languages have only recently caught up to its features. http://paulgraham.com/diff.html

I wish the query was written from big to small, where you pick tables first, join them, and then select your projection. It seems more natural to think of narrowing down your query.

I wish you didn't have to explicitly state the foreign keys and add indicies, and it'd figure it out for you, and only ask in ambiguous cases.

I wish there was a way to modularize it and make parts of queries reuseable--though I know there's a WITH expression.

I wish you didn't need to specify the tables and how they're related in the queries, and just ask for the data by columns.

I have my complaints about some of the details of the SQL langauge and syntax as well but I don't understand this desire to import features of iterative and functional programming languages into SQL. These are totally different domains and these features don't really cross-over. SQL is a tool for defining what data to pull from a database whereas those features are for more easily decomposing an iterative (or functional) process. Trying to break a complex queries into simpler sub-components (and not have those sub-components act as an optimization barrier) seems like an entirely different kind of problem for which entirely different tools would be required, even if that's not obvious from the outset.

SQL is to data as DRAM is to compute. It's the thing most-fit for the job. We're unlikely to invent some new way to do those tasks, perhaps refine them, but fundamentally, they are the best-fit.

SQL wasn't so much invented as it was discovered.

I've found that people who dislike SQL really just tend to struggle with set-based logic and thinking (not saying that's the case for you). It's an absolutely beautiful and powerful language.

Powerful, absolutely yes. Beautiful? Absolutely not. This is of course subjective, but come on. Insofar as you can be objective about beautiful code, SQL stands over in the corner wishing it had remembered to shower while all the other kids are elegantly rotating on the dance floor.

Except PHP. It's hiding in the bathroom.

Just curious if you've looked at Microsoft's LINQ language? It addresses many of the pain points you've mentioned, and would be (in my mind) a good model for the next iteration of the SQL standard.

Yeah, although it's been a while. Something very much like that is what I have in mind, though. My biggest complaint with LINQ is that it appears to be more-or-less another ORM: you're still dealing with objects, instead of datasets. (I never actually used it, though, so could very well be completely off base here.)

But yeah, that is at the very least better than SQL.

I think of SQL like the JSON format: a standard that became standard because it's so boringly simple for software to write and parse. You can easily pick up enough in a day to pass queries from other languages, grab the data, and go back to your language of choice. SQL does its job and leaves features and complexity for other tools.

You do not seem to understand SQL

Joins for pd.DataFrame.merge() and SQL are explained in exactly the same way, it doesn't matter which order you learn them in. I learned them first in pandas and found the syntax to be more intuitive there.

I'm quite frankly stunned to be reading this. In my experience in industry it's almost universal that pandas syntax is overall pretty horrendous.

SQL is light years more intuitive in my eyes.

  df_merged = df1.merge(df2, on="common_key")

  SELECT * INTO df_merged
    FROM df1
    JOIN df2
    ON df1.common_key = df1.common_key)
Python and its mature libraries are vastly more concise than most alternatives, SQL included. Especially when you get to beginner Pandas vs. beginner SQL. The latter is an absolute horror show.

You don’t need the “INTO df_merged” or entire second line, right?

Absolutely disagree. Pandas is the horror show.

create table df_merged as (select from df1 join df2 using(common_key))

Perhaps a bit more verbose but infinitely more readable and parseable.

How about this one. Add a between join i.e. "AND where df1.datecol between df2.datecol - interval '30 day' and df2.datecol + interval '30' day" in pandas?

This is an extremely common operation in analytics, by no means an esoteric use case.

This comparison between pandas and SQL (from the pandas site) is a good reference:


Here's one example:

-- tips by parties of at least 5 diners OR bill total was more than $45 SELECT * FROM tips WHERE size >= 5 OR total_bill > 45;

# tips by parties of at least 5 diners OR bill total was more than $45 In [12]: tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]

another one...

SELECT day, AVG(tip), COUNT(1) FROM tips GROUP BY day;

tips.groupby('day').agg({'tip': np.mean, 'day': np.size})

These are all very simple queries, and things can get very complicated, so this list of comparisons is hardly the last word. I've had to untangle sql, but then again, I've had to untangle python code as well. I do find the SQL expression of these two queries very clear when I read it, whereas I have to expend more mental effort on the others, especially the aggregation. And I think people who don't program would have an easier time reading the SQL as well.

Also, I find it is valuable to be able to run queries directly against a database, or port them to a new environment. Pandas and R data frames are fairly similar, but if you've written them in sql, the transfer is zero effort.

I also find that if you want to join several tables or more, select only a few columns, and do aggregations with HAVING clauses, with vertical stacking perhaps, the pandas code does get considerably more complicated. The SQL does too, and UNION queries are not pretty... but overall, I think the SQL code certainly can come out ahead, and times considerably ahead, in terms of expressing clarity of thought and intent to someone reading the code later.

Oh one other thing you did add a bit of code by selecting into a data frame. With pandasql, you can run sql directly against a data frame and switch back and forth between pandas and sql operations, you don't need the additional step of creating and selecting into a new table. In this case, the code would look like

df_merged = pysqldf("SELECT * FROM df1 JOIN df2 ON df1.common_key = df1.common_key")

This is really nice, since there are some data frame operations that are much simpler in SQL and others that are much simpler in Pandas (actually, a lot of those aggregations would be handled through df.describe() -- but the querying and subsetting getting to that df may be more succinctly expressed in SQL).

I also love it. Not even sure what the argument is here against it.

I love SQL as well, but SQL is too low level for "data scientists" that are used to copy paste scripts from the internet and working with pre-baked packages.

Admittedly, there are data scientists in roles that consist of hacking together opague scripts until something comes out. But I do think people misperceive SQL as "low level" compared to R/Python, when ironically, SQL is actually high level, in terms of a programming language.

> grokking

Hey they brought me a smile this morning thank you ( I love Heinlein ).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact