Hacker News new | past | comments | ask | show | jobs | submit login
Modern Data Practice and the SQL Tradition (tselai.com)
286 points by MarkusWinand on Nov 8, 2019 | hide | past | favorite | 221 comments

I've personally grown to love SQL and I think it is by far the clearest (if verbose) way to describe data transformation work. I learned traditional programming (e.g. languages like Python) long before I stumbled on SQL, but I'm not sure I could've understood Pandas/R as well without having learned SQL, particularly the concepts of joins.

That said, my affection for SQL correlates with my overall growth as a software engineer, particularly grokking the concept of piping data between scripts: e.g. writing a SQL script to extract, filter, and transform data, and then a separate Python/R script that creates a visualization. I think SQL seems so unattractive to data scientists (i.e. people who may not be particularly focused on software development) because it seems so alien and separate from the frameworks (R/Pandas) they use to do the actual analysis and visualization. And so it's understandable to just learn how to do it all in R/Pandas, rather than a separate language and paradigm for data wrangling.

I think it might go deeper than that. I wear a lot of hats - part software engineer, part data engineer, part data scientist - and I find that the only perspective from which RDBMSes feel like the clear front-runner is when I'm wearing the software engineer hat.

When I'm wearing the data engineer hat, I start to get interested in some of the options that aren't even databases at all, such as storing the data in ORC files. Because they sometimes offer big performance advantages. The data intake pipelines I'm working on are batch-oriented, so they benefit little from how RDBMSes are optimized for random access to records, and maybe also ranges of records, but only at the cost of write performance if you're looking to filter on a natural key instead of a synthetic key. . . it gets thorny. ORC is less-than-stellar at selective ad-hoc queries, but I'm not doing selective ad-hoc queries against the internals of my data intake pipeline, and I will fight to the death to prevent the BI team from doing so.

When I'm wearing my data scientist hat, I've got no idea what my next query on the data is, so there's no indexing strategy that could possibly meet my needs. That kills one of the RDBMS's biggest value propositions. So let's just wing it with map-reduce and get on with life. And, TBH, SQL is just too declarative for the way my brain works when I'm in analyst mode - the semi-imperative workflow I get out of something like Spark SQL or R or Pandas feels more natural.

Have you considered snowflake or redshift? I've gone very far ahead with just simple SQL using these tools for a variety of DE and DS projects. Once even managed to get an ML model running on redshift.

If that is your jam, check out BigQuery' ML in SQL

I wish SQL had syntax more similar to dplyr[1] for complicated queries. I dislike having to do a nested sub-query every time I re-group the data - the query ends up being a deeply indented mess that needs to be read from the inside-out. Using "WITH" clauses helps a lot, but it's still not as easy to read as the more imperative dplyr style.

[1] https://dplyr.tidyverse.org/

Be wary of WITH statements as they are optimization barriers as some engines can't optimize them -- that's changing but just FYI. I believe the latest Postgres release is handling them properly now -- I could be wrong.

Good note. Subqueries are sometimes optimization barriers as well - it depends on query engine. If you care about performance you need to look at the query plan.

"If you care about performance you need to look at the query plan." Always I would say, once your data gets above toy size.

we use azure data explorer for a lot of data storage. It is a terribly boring name but it highly performant at a large range of data sizes, stores rectangular (table) data with clear schematization, and uses the piping type of syntax of tidyverse. I love it. I really wish they would rebrand and market it more.

SQL is indeed a powerful too. But it's just one of many. Just today I used: SQL to dump data to csv, python to modify it, and gRPCurl to upload it to another service. Different tools, different problems, use the right one.

That being said ... RDBMs still suffer from scale and there are many efforts to fix this with sharding to disaggregating the data layer from the query layer with examples like AWS's Aurora and GCP's BigQuery etc -- in all likelihood data will likely live in many different places and will need to be adapted into things like a datastore or queried from things like Hive and Presto, or even a traditional ETL + EDW setup -- my point being there's no one thing that will solve everything as most single solutions break down when the number of rows is measured in the billions or more.

> SQL to dump data

Is this another way to say you use SQL to query specific data from a larger dataset?

I imagine they mean SQL to unload from whatever SQL storage engine out to blob(s3,azure,etc), HDFS or local-file system.

COPY INTO s3://mybucket/foobar/ from (select a, b, c from ...);

Yeah sorry. I used the cli front end to my database from a container and had it dump the results to a file that I then massaged with python

> I've personally grown to love SQL and I think it is by far the clearest (if verbose) way to describe data transformation work.

I'm going to go against the grain here, at least as expressed in this thread, and say something that I think is defensible, but...

SQL is a terrible language.

1. It's old. There have been many, many advancements in programming languages since its inception in the mid-80s, none of which have made it into the language itself.

2. SQL is naturally tedious, with few mechanisms for isolating common functionality into e.g. objects/interfaces/functions/monads. Stored procedures exist, but still suffer from the other shortcomings of the language.

3. There's no vendor-agnostic way to write unit tests.

4. The code itself is ugly. There isn't a standardized (or even semi-standardized) way to confidently format code, compared to languages like Python or Java. I've seen (and very occasionally written) code in languages like Ruby or Java that is elegant to the point of being artistic. SQL is almost universally ugly.

5. Over time SQL tends towards becoming unmaintainably complex, for reasons I've come to believe are endemic to the language itself. In my experience and almost without exception, SQL has been the source of the most complex and difficult to maintain/test code in every codebase I've worked with. Views that are 300+ lines long, with multiple layers of inner joins, are common. And no one wants to touch these monsters because it's almost inevitable that refactoring will break something way over yonder.

6. There's no type checking.

7. There's no standardized library sharing framework like RubyGems or Python's pip.

8. IDE support is limited, and is mainly limited to basic code formatting. Compare this to something like IntelliJ where you can just ctrl-click on a method or variable and go to the definition. Even TypeScript is better here.

9. Refactoring is dangerous, and the ability for your tools to lend a hand is limited. Want to rename a public method in Java? Right click. You can be pretty confident doing so won't break your app (with caveats). If you want to rename a column in SQL, you're not going to have nearly that amount of confidence.

Now, I think there is a place for something like SQL: fast, able to easily access and update large amounts of data, etc. Something that executes directly against an RDBMS.

I just wish it wasn't SQL.

I hate point-by-point rebuttals, but here goes mine anyway ;)

> 1. It's old. There have been many, many advancements in programming languages since its inception in the mid-80s, none of which have made it into the language itself.

It's even older than that (around 1979), but being old isn't an argument in itself.

> 2. objects/interfaces/functions/monads, Stored procedures

Not the job of a database

> 3. There's no vendor-agnostic way to write unit tests.

As compared to alternative NoSQL database code that isn't portable at all? Agree though that the testing culture in SQL land could be improved.

> 4. The code itself is ugly.

Entirely a subjective matter. Programming != poetry.

> 5. Over time SQL tends towards becoming unmaintainably complex

If queries are complex in a language that is already very compact compared to equivalent procedural code, then chances are they're complex because the business problem is irreducibly complex, and another langauge won't save you here.

> 6. There's no type checking.

Of course there is! You can't insert character data into number columns for example (SQLite handles this a bit different though), and will receive proper type-related error message on eg. date functions.

> 7. There's no standardized library sharing framework like RubyGems or Python's pip.

What exactly would a stdlib (of table definitions?) be useful for? The standard is the SQL language itself. Maybe the state of portable SQL scripting could be improved by eg. lifting the syntax of eg. Oracle SQLPlus, also implemented by DB/2 since a couple years, or another syntax into the ISO SQL standard.

> 8. IDE support is limited, and is mainly limited to basic code formatting.

Last I checked, IDE support for SQL was quite good on Eclipse, including autocompletion. When you assemble SQL from strings in your app, there's a limit to what an IDE can do.

> 9. Refactoring is dangerous

SQL/relational algebra has a very powerful construct known as "views" - a consistent relational interface for your apps to work against. Refactoring is as good or bad as you make it to be.

I think SQL works well for its problem domain and agree with everything you said.

> It's even older than that (around 1979), but being old isn't an argument in itself.

I wanted to add an additional comment on this point. What is it with old automatically equalling bad? I definitely appreciate where improvements can be made, but I think as an industry we incorrectly think there is something wrong with an old/mature technology. Throwing things away just because they are old hinders progress since we're constantly rewriting tested, working code. I'll get off my soapbox now.

> What is it with old automatically equalling bad?

That's not what was said. What was said that it's old and that it hasn't benefited from decades of research on languages and the way programmers are most productive.

I have my complaints about some of the details of the SQL langauge and syntax as well but I don't understand this desire to import features of iterative and functional programming languages into SQL. These are totally different domains and these features don't really cross-over. SQL is a tool for defining what data to pull from a database whereas those features are for more easily decomposing an iterative (or functional) process. Trying to break a complex queries into simpler sub-components (and not have those sub-components act as an optimization barrier) seems like an entirely different kind of problem for which entirely different tools would be required, even if that's not obvious from the outset.

I also didn't like SQL before, but I've come to realize it's mainly because there's a boundary/interface between set-based modeling and imperative/functional based modeling. The former models things as records and relations, and the latter two models things as data structures. And it's this mismatch that ORMs were suppose to solve, but it's been a quagmire, and often called the Vietnam of Computer Science.

However, I've come to appreciate it, and it's pretty great for what it gets you. Databases use to be very data structure specific. So if you wanted to migrate to a different database, you'd have to map the data from one db's data structure implementation to another!

Queries also use to be path dependent! That means you could only access data through a specific object, like json or XML. This can be problematic if you haven't fully figured out the query patterns. With relational, you can be flexible with queries, and don't have to know them ahead of time.

Lastly, when you think about all the imperative/functional code you'd have to write to execute a query with joins, it's pretty great that you can just express it declaratively.

As for some of your reasons, I'll just pick a few of my reactions: 5) a 300+ line program is small compared to most imperative code bases out there. But it is dense, I concede. 6) type checking is enforced in the schema 9) This seems to be property of code that works with data. Unless the data has a way of carrying versions and schema changes with itself, I'm not sure how downstream code of any paradigm would know something has changed.

Now, I think SQL could be improved, just not for the reasons you've given. Just because a language is old, doesn't mean it doesn't have good idea. Lisp is one of the oldest languages out there, and lots of languages have only recently caught up to its features. http://paulgraham.com/diff.html

I wish the query was written from big to small, where you pick tables first, join them, and then select your projection. It seems more natural to think of narrowing down your query.

I wish you didn't have to explicitly state the foreign keys and add indicies, and it'd figure it out for you, and only ask in ambiguous cases.

I wish there was a way to modularize it and make parts of queries reuseable--though I know there's a WITH expression.

I wish you didn't need to specify the tables and how they're related in the queries, and just ask for the data by columns.

SQL is to data as DRAM is to compute. It's the thing most-fit for the job. We're unlikely to invent some new way to do those tasks, perhaps refine them, but fundamentally, they are the best-fit.

SQL wasn't so much invented as it was discovered.

> Now, I think there is a place for something like SQL: fast, able to easily access and update large amounts of data, etc. Something that executes directly against an RDBMS.

Yes, I don't necessarily disagree with anything you said. I didn't mean to imply that SQL was ideal for very good beyond a fairly limited set of functionality, which I would loosely describe as: a system for which many stored operations and refactoring becomes a necessity.

The use case I have in mind is with wrangling multiple, disparate datasets, into a normalized format that can be piped into a visualization function, or into a Rails app (for which an ORM like ActiveRecord can suffice going forward). And also, I'm thinking of people relatively new to data in general. I find SQL as a language offers a more explicit and rigid way of communicating data transformation concepts. After that, for most people who end up learning pandas/R, it's generally easier to do all the work in those frameworks (especially if you like notebooks).

I've found that people who dislike SQL really just tend to struggle with set-based logic and thinking (not saying that's the case for you). It's an absolutely beautiful and powerful language.

Powerful, absolutely yes. Beautiful? Absolutely not. This is of course subjective, but come on. Insofar as you can be objective about beautiful code, SQL stands over in the corner wishing it had remembered to shower while all the other kids are elegantly rotating on the dance floor.

Except PHP. It's hiding in the bathroom.

Just curious if you've looked at Microsoft's LINQ language? It addresses many of the pain points you've mentioned, and would be (in my mind) a good model for the next iteration of the SQL standard.

Yeah, although it's been a while. Something very much like that is what I have in mind, though. My biggest complaint with LINQ is that it appears to be more-or-less another ORM: you're still dealing with objects, instead of datasets. (I never actually used it, though, so could very well be completely off base here.)

But yeah, that is at the very least better than SQL.

I think of SQL like the JSON format: a standard that became standard because it's so boringly simple for software to write and parse. You can easily pick up enough in a day to pass queries from other languages, grab the data, and go back to your language of choice. SQL does its job and leaves features and complexity for other tools.

You do not seem to understand SQL

Joins for pd.DataFrame.merge() and SQL are explained in exactly the same way, it doesn't matter which order you learn them in. I learned them first in pandas and found the syntax to be more intuitive there.

I'm quite frankly stunned to be reading this. In my experience in industry it's almost universal that pandas syntax is overall pretty horrendous.

SQL is light years more intuitive in my eyes.

  df_merged = df1.merge(df2, on="common_key")

  SELECT * INTO df_merged
    FROM df1
    JOIN df2
    ON df1.common_key = df1.common_key)
Python and its mature libraries are vastly more concise than most alternatives, SQL included. Especially when you get to beginner Pandas vs. beginner SQL. The latter is an absolute horror show.

Absolutely disagree. Pandas is the horror show.

create table df_merged as (select from df1 join df2 using(common_key))

Perhaps a bit more verbose but infinitely more readable and parseable.

How about this one. Add a between join i.e. "AND where df1.datecol between df2.datecol - interval '30 day' and df2.datecol + interval '30' day" in pandas?

This is an extremely common operation in analytics, by no means an esoteric use case.

You don’t need the “INTO df_merged” or entire second line, right?

This comparison between pandas and SQL (from the pandas site) is a good reference:


Here's one example:

-- tips by parties of at least 5 diners OR bill total was more than $45 SELECT * FROM tips WHERE size >= 5 OR total_bill > 45;

# tips by parties of at least 5 diners OR bill total was more than $45 In [12]: tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]

another one...

SELECT day, AVG(tip), COUNT(1) FROM tips GROUP BY day;

tips.groupby('day').agg({'tip': np.mean, 'day': np.size})

These are all very simple queries, and things can get very complicated, so this list of comparisons is hardly the last word. I've had to untangle sql, but then again, I've had to untangle python code as well. I do find the SQL expression of these two queries very clear when I read it, whereas I have to expend more mental effort on the others, especially the aggregation. And I think people who don't program would have an easier time reading the SQL as well.

Also, I find it is valuable to be able to run queries directly against a database, or port them to a new environment. Pandas and R data frames are fairly similar, but if you've written them in sql, the transfer is zero effort.

I also find that if you want to join several tables or more, select only a few columns, and do aggregations with HAVING clauses, with vertical stacking perhaps, the pandas code does get considerably more complicated. The SQL does too, and UNION queries are not pretty... but overall, I think the SQL code certainly can come out ahead, and times considerably ahead, in terms of expressing clarity of thought and intent to someone reading the code later.

Oh one other thing you did add a bit of code by selecting into a data frame. With pandasql, you can run sql directly against a data frame and switch back and forth between pandas and sql operations, you don't need the additional step of creating and selecting into a new table. In this case, the code would look like

df_merged = pysqldf("SELECT * FROM df1 JOIN df2 ON df1.common_key = df1.common_key")

This is really nice, since there are some data frame operations that are much simpler in SQL and others that are much simpler in Pandas (actually, a lot of those aggregations would be handled through df.describe() -- but the querying and subsetting getting to that df may be more succinctly expressed in SQL).

I also love it. Not even sure what the argument is here against it.

I love SQL as well, but SQL is too low level for "data scientists" that are used to copy paste scripts from the internet and working with pre-baked packages.

Admittedly, there are data scientists in roles that consist of hacking together opague scripts until something comes out. But I do think people misperceive SQL as "low level" compared to R/Python, when ironically, SQL is actually high level, in terms of a programming language.

> grokking

Hey they brought me a smile this morning thank you ( I love Heinlein ).

Not to step on anybody’s toes, but… I often suspect that there are lots of people who tried, and failed, to learn relational database techniques and gravitate toward schemaless solutions like Mongo just because they’re easier to understand. Maybe not everybody, but more than a few that I’ve interacted with.

They're superficially easy to understand, but end up moving the complications of concurrency, transactions, and statistical query planning up into the application. Those are much harder problems to solve correctly than just learning SQL and understanding the output of EXPLAIN.

It's a common ignorance pattern: hipsters reject complex and mature effective tools that require a learning investment (in this case a RDBMS and its fancy configuration options and SQL capabilities) because they don't know what they can do, and embrace inadequate familiar and/or shallow tools (in this case, misused fashionable and "schemaless" NoSQL databases) because they are confident that they can fill the gaps (in this case application code in familiar programming languages, to the extreme of "reinventing the wheel" with completely custom frameworks).

I was trapped in this mode of thinking for a few years when I was younger. I really regret falling for it, and try to help other people when I see them falling for it. I wish somebody had set me straight, so I'd not have spent those years searching for excuses to perpetuate my ignorance.

> hipsters reject complex and mature effective tools that require a learning investment

I think there is an element of that, but let me suggest another possibility: this is driven by modern (so-called “agile”) project management techniques. Every week or so, the programmers are called on the carpet to provide “estimates” for a series of a dozen or so vaguely-worded “tasks”, each of which does have clear business value but is pretty open-ended about acceptance criteria. These estimates can only be on the order of a few hours: if the estimate goes higher, the project manager insists that you “break it down” until the individual subtasks can be estimated on the order of a few hours. This “breaking down” process, likewise, is expected to only take an hour or so - if it takes longer, you’re going to be answering for it next review period.

So you have a “task” that looks like it could use a relational database, but you don’t really know relational databases that well. You’re not an arrogant narcissist, so you recognize that something that takes other people weeks or months to learn effectively is probably going to take you weeks or months to learn effectively, as well. You _do_, however, know how to stuff documents into a document repo, and you can estimate that, and the estimate is within the acceptable four-hour maximum that the project management group that is, for some reason, now running the show, permits.

Even if you’re supremely dedicated and spend your off time studying (assuming you don’t have a family and _have_ any off time), you still have to come up with something RIGHT NOW so maybe after a few weeks of evening study you might have been able to produce something better, but the idiotic “sprint” ends on Friday, so you stuff it into a key-value store, mark the task complete, and spend orders of magnitude more time working around the problems that were introduced by myopic, shallow-thinking project managers than you would have spent actually putting together a decent solution.

So there’s that.

From what I've seen and personally experienced, the sort of inexperience-masking hipsterism he's describing starts earlier. In university or highschool, years before Agile or other management styles become relevant to the individual.

Even a superficial difficulty can make a big difference for some people. If you start out with a NoSQL database because it seems simpler, and then with each problem you run into, it seems simpler to solve it with NoSQL because that's what you're familiar with.

Often the best tool for a job isn't the easiest to learn. Presumably a person who calls themself a "Data Scientist"—or even just a developer—should be able to figure out SQL or take a class. Applied Knowledge is after all what we get paid for.

If you are a professional developer and can't figure out SQL then you don't deserve that 6 figure salary.

Many data scientists are STEM PhDs with research under their belts. This shows they can figure out complex, technical challenges and then translate it into written, persuasive materials. I'd be surprised if most of these highly learned, gritty individuals (1/2 of grad students develop mental health illnesses) could not learn SQL, even if it is tricky/ugly/less user friendly.

I use SQL daily and while it's not ideal, it provides a very reliable foundation upon which R and STATA scripts build.

The author of this essay is beating a dead horse. "NoSQL" (I really dislike the term) has proven its value over the years and deserves a seat the table. People will always misuse technology or implement it poorly, but I don't think it warrants yet another oversimplified "SQL vs NoSQL" rant.

Genuinely curious since I have only seen bad things said about NoSQL in most places over the years, what benefit does it provide other than scaling?

It's unlikely that a company needs to invest in data scientists or even thinking very hard about their data organization until the scale of their data is already pushing the bounds of what most RDMMSs can handle. NoSQL is nice if you are planning for scale, since it will seamlessly get big without much thought or any significant changes to the performance. There are no "gotchas" that will cause very long-running queries or that will lock a huge number of rows, so performance is very good and most importantly very stable. This is largely due to the fact that you have to think more deeply about your data access plan up front, since almost any query is a table scan (which, ideally, never happens). I think that this is secretly a benefit in that it forces people not to perform ad-hoc queries on databases, and to think of their databases in terms of the APIs that they have built over their databases, because the databases are not going to efficiently support any other sort of access than the access plans included to support those APIs. I would lean in favor of a NoSQL database to back a production networked service because the upsides are helpful in this case (stable performance, easy to scale) and the downsides are not significant (have to plan your APIs up front -> going to do that already, no ad-hoc queries -> not going to run ad-hoc queries on a production database anyway).

> It's unlikely that a company needs to invest in data scientists or even thinking very hard about their data organization until the scale of their data is already pushing the bounds of what most RDMMSs can handle.

I work in an engineering data science consulting group for a major UK PLC. Almost all of our clients have mainly SQL based data sources, and none of them really require cloud scale. In fact I don't think I have come across a client whose data I can't handle in SQLite on my laptop.

However, the problems we solve do require data scientists - mainly STEM MScs with a smattering of PhDs who are capable of combining statistics, domain knowledge and the ability to work within a variety of organisations and present.

Seconding that. Several ‘AI’ startups I worked with, do not require a massive amount of data to create MVPs/first releases.

The do however invest early in sufficient data science functions to have a good liaison between the researchers and data acquisition.

> It's unlikely that a company needs to invest in data scientists or even thinking very hard about their data organization until the scale of their data is already pushing the bounds of what most RDMMSs can handle.

I disagree. Data science is largely about applying and interpreting data, and the value of that isn't really dependent on having data that is problematic for RDBMSs due to scale.

If you think RDBMs can't horizontally scale you have not kept up with the developments :) Look into https://docs.citusdata.com/en/v8.3/get_started/what_is_citus...

Immutable/stream databases are a fundamental innovation, even if everything else turns out to have been unsustainable performance hack.

I never really understood how anyone expected to have a sensible debate on "thing" vs "anything that isn't thing" (I realize "NoSQL" is somewhat better defined than that, but not by much).

Agreed. The more meaningful debate would be "relational" v. "not relational", albeit marginally (swapping "not relational" for something more specific like "document" or "key-value" would be even more meaningful). There are indeed relational databases that are technically "NoSQL", like Mnesia.

well said

I've been working with SQL a lot in my job lately. For what it's worth, I'm a big fan.

It seems the central argument here is NoSQL doesn't force you into good design habits so it's overrated. I'll concede this is partially true from my perspective because much of my work involves trying to sanitize, transform, and otherwise refactor poorly structured or designed NoSQL datasets.

But I've also seen my fair share of SQL databases which are poorly designed, don't use features which are meant to benefit developers (I haven't seen a Foreign Key smartly implemented in a LONG time). It's not really fair to say NoSQL has encouraged poor design practices; from my experience it seems like data model implementation is given little effort in general.

NoSQL takes the 'training wheels' off data model implementation where SQL is like keeping them on, but even with them you can still fall off the bike if you aren't careful though it's much harder.

>I haven't seen a Foreign Key smartly implemented in a LONG time

This has also been my experience as a web developer mostly. I rarely see any applications that make use of foreign keys constraints supplied by the database server. Usually I see relations being handled by code on the application side.

Even when I build apps that are using SQL, I always implement this stuff in code instead of FKs even though I know how they work and when designing a database schema in an app like (for example) mysql workbench, I have the option to add them easily. But I always use the "ignore foreign keys" option and then implement the constraints in code. I just find it a bit more sane to have all of the logic inside the app.

I know I'm probably "doing it wrong" but would love to hear some other opinions about this from the hn crowd that does web development. I'm guessing that more "enterprisey" apps like CRM's and ERP's will probably use more of the native database stuff.

Application-level checks are prone to race conditions, especially at the default transaction isolation levels of many popular SQL databases.

For example: transaction X deletes a parent row and its child rows while a concurrent transaction Y inserts a new child node. Without a FK, this can leave an orphaned child row.

For similar reasons, SELECT before INSERT is generally not a safe replacement for a UNIQUE constraint.

You don't even need that high a user volume to see these errors in practice.

It's a shame that ORMs don't tend to encourage the use of database-level constraints, but even if I just want to build something quickly in e.g. Rails, I still add the constraints as an extra check because I'd rather get the occasional 500 than an inconsistent database.

For the most part it doesn't matter in a lot of "web development", most web devs education is critically lacking when it comes to databases in general.

The moment you or your customers need business intelligence and complex queries, you quickly realize how important FKs and indexes actually are. I firmly believe that you should always have FKs and indexes, referential integrity at the db level is a great thing to have and is cheap and easy to implement.

Even if your apps do not make direct use of FKs (through cascading deletes for example), they, along with other constraints, are smart to have in place for the protection they provide. If every order requires a customer, why not also enforce that rule in the database? The application still has to handle it properly, but the database makes sure it's handled.

Of course you're probably thinking that it's no big deal when you're writing the app and the database at the same time. I will sort of concede that point (although I think constraints, FKs, etc... should all be used unless later performance testing shows they must be removed in certain cases). But, data nearly always outlives the original application. When it's time to add to/update/rewrite the original application code is when you'll be ecstatic to have the business rules enforced in the db.

Thanks everybody, pretty much all great and valid points. This is all usually very much glanced over and oversimplified in most online docs and tuts in the "web development" area which is unfortunate since it's usually all mostly about keeping and maintaining said data.

I frequently get the self-awareness that there is a lot of sql functionality that I'm probably missing, misusing or completely ignorant of.

Particularly, the fk example just stood out because even though it's usually an integral part of database design i don't think enough importance is given to them "overall".

As @mpartel stated, as most orm's have similar functionality built in, their usage is just basically ignored. Also good example about the transactions stuff, I rarely see the use of foreign keys encouraged (and on your example pretty much almost required) in online tutorials and docs when discussing transactions.

More of a rant really, thanks again for the input guys! +1

When I started as a web dev, I simply didn't know about them. The frameworks I was using didn't really emphasize it, so I glossed over it.

Besides referential integrity, foreign keys may improve the query plan (depending on vendor).

But either way, I can't imagine any reasons why you wouldn't want constraints to ensure you database is in a consistent state. There can always be bugs in you application code, manual db updates, etc.

If you have to ask then should use them to maintain integrity, only devs who know what they are doing can have the luxury of ignoring them because of sharding, performance, complex schema migrations etc..

I's nice that the article mentions Codd and his relational model of data but what it doesn't mention is how badly SQL parrots relational algebra. The language is inspired by the idea ("based on a real story"(c)), yes, but it takes a really clean and sound model and makes an unbelievable mess out of it.

SQL is just an ugly historical accident. Unfortunately, this how it often works...

NoSQL are a different story, of course. BTW, I believe that they predate Codd's work. There were many examples of non-relational DBs in the 70s.

> SQL is just an ugly historical accident.

Maybe, but as Bjarne Stroustrup said: "There are only two kinds of languages: the ones people complain about and the ones nobody uses." SQL seems like that: pragmatic, not perfect, but everyone uses it.

Bjarne invented C++, I don't think parroting Bjarne contributes anything to the discussion.

You don't think quoting a person responsible for creating one of the most influential programming languages contributes to a discussion on programming?

I'd call C with classes a very costly mistake, that's probably set back computing by decades. But I can't deny its influence, although I would respectfully disagree that it's a good thing.

You're talking about C, right? :-)

It goes shows how power the relational model is when a language as bad as SQL can still succeed enormously :)

I'd love to educate myself more on how SQL mangles rel. alg. and whether there's another purer implementation. Any links?

Rel is a DBMS based on Codd and Date's Tutorial D. https://reldb.org/c/

For a discussion on how SQL speficially falls short of the relational ideal, see this link: http://www.nocoug.org/Journal/NoCOUG_Journal_201308.pdf

Pascal and Date have been making these arguments forever: http://www.dbdebunk.com .

Yes, there were various approaches to non-relational data stores but they were not so flexible in terms of "schema", which I believe is the main strength of NoSQL.

A possible exception could be MUMPS https://en.wikipedia.org/wiki/MUMPS but I have no direct experience with this (while I used something akin to https://en.wikipedia.org/wiki/Hierarchical_database_model at the start of my career).

Pick[1] subscribes to a similar philosophy, merging the OS into a database.

[1] https://en.wikipedia.org/wiki/Pick_operating_system

Wow. We use this (Universe "flavor") at work, can't believe I'm seeing it on HN.

> how badly SQL parrots relational algebra

I’m continually amazed that we’re STILL primarily interacting with databases, programmatically, by passing SQL statements as strings into them, rather than, say, passing in a list of column names as an array, and a list of conditions as tuples, etc. ORM’s try to bridge this gap, but they do a pretty bad job of it; they end up oversimplifying so much that the result is unreasonably inefficient.

The difficulty with the approach you propose, and by contrast one of the major strengths of SQL, is composition. Passing a list of column names and a list of conditions lets you express precisely two concepts: Filtering and Projection. You could also add more: A set of tables to join together, an "aggregation" version of the same operation, etc... Going down this path, however, leads to a monolithic function that becomes progressively harder to generalize.

What relational algebra (and by extension SQL) gets "right" is that each of these operations (Projection, Filtering, Join, Aggregation, Union) are composable: They take 1 or 2 collections as input and produce a collection as output. Moreover, each operation has simple and well-defined semantics from which you can build much more complex logic.

That's not to say that Relational Algebra can't be built in to an imperative language. Scala (and by extension Spark) collections are a great example of composable operators at work. Ruby's array methods, Python comprehension syntax, and Pandas/NumPy are similar examples of simple, composable primitives that combine to be much more powerful data transformations.

Apart from RA-based language primitives, there's also compiler support that allows you to use SQL directly, but avoid passing strings around at runtime. .NET's LINQ is a great example. I'll also pitch one of my own projects, DBToaster (https://dbtoaster.github.io/), which compiles view maintenance queries down to a C++ or Scala class.

In short, I agree that passing strings around leaves performance on the floor and leads to escaping and code injection nightmares. But SQL is the culmination of literally decades of use-based design, and any effort to replace it needs to take care to understand what it does well and why (like the efforts I reference above)

Amazed until you realize the database must be able to accept SQL statements from many many different languages using many types of network drivers. Rather than implementing this (correctly) everywhere, centralizing the "smarts" in the database so that the languages and drivers just pass a string becomes a much less amazing design decision.

I'm sorry but ROFL. ORMs trade programmer ease for too much query performance imo. In all my experience it makes a lot more sense to get closer to the data with SQL than to use something like an ORM in the same way that C is faster than Python, SQL is closer to the metal like C is, and an ORM is more like Python -- quicker to prototype in but not fast for this use-case.

There's tooling that bridges the gap. An example I've been using recently: https://github.com/andywer/postguard I understand Java has a lot of these as well.

This is one of the concerns where embedding a language in another is sort of necessary but is a bit too much to have the compiler handle it by default unless DSL's are already a first class construct in the language.

Your upstream tooling is going to have to compile something into that data structure you propose, and then you're going to want to do ad hoc queries easily as a human, and end up inventing something worse.

Side note with postgres you can "prepare" a statement, compile it once, give it a name, and then execute that query without passing the full query every time.

The people that bemoan SQL are like the people that bemoan staff notation. Every [music|data] student thinks they can do better when they first learn it instead of embracing the though that's gone into it over many years.

Interesting analogy, the problem is staff notation has been used by every basically major composer for hundreds of years. SQL, while important, is far less universal and far more complex. In the end, staff notation is more like writing than it is like SQL. Writing and staff notation both have a limited number of characters with an insane number of possible combinations. With SQL, the main challenge lies in understanding the underlying data structures, not the declarative symbols of SQL itself.

I don't agree with the vilification of NoSQL, but I do agree that SQL is a great query language.

That's partly why I wanted to create a tool to query various databases (NoSQL ones or files too) with SQL. We're still in an early stage with OctoSQL regarding the variety of databases, but take a look if that sound appreciable to you: https://github.com/cube2222/octosql/

What sets OctoSQL apart from the existing options such as Apache Drill (even Spark SQL for that matter) or future projects such as PartiQL?

https://partiql.org/ https://drill.apache.org/

We're aiming to have very ergonomic stateful stream processing with only SQL and we're working on it currently. That's basically what's meant to set us apart.

So tapping into the change-logs of the underlying data-sources and providing a stream processing layer that's expressible in a stream SQL dialect?

btw, not being critical of your project, just trying to understand it.

Mainly thinking of explicitly stream oriented data sources like kafka, but yeah, change logs are really solved with an analogous abstraction.

There's a great paper on that: "One SQL to Rule Them All", check it out.

We also want to scale well from single computer one-of data exploration queries, to full blown clustered long-term stateful stream processing.

The point is to provide a well thought out SQL interface to as many data sources as possible, and like drill does, push down as much computation as possible.

We actually learned about drill only after creating OctoSQL, but that's another story. (We're definitely less mature currently and support fewer datasources)

Ha! I was going to link that to you once I understood what you meant.


I believe there is a similar effort going on in the Flink project. They're slowly merging in work from the Blink project to provide a unified SQL paradigm (batch + stream).

Yup, Akidaus book is a great resource too.

We're basically aiming for a middle ground between Flink, Presto and Drill.

Coming from the boring enterprise world, where we didn't really get swept along in the NoSQL hype, this doesn't seem like it should remotely be a surprise to anyone.

"Hey, turns out there's lots of great ways to use SQL!" Yeah, we know, they've been at the core of our business for the last 20 years at least.

Lots of enterprise shops started out as SQL databases to keep track of all the data, and web apps grew up organically around them.

SQL is a functional programming language. No other imperative programming model will be cleaner or clearer in expressing intent. But, you have to understand SQL to begin with.

Did you mean to say declarative? That’s what makes it clear in intent isn’t it? That it’s declarative. Whether or not SQL is also functional is orthogonal to that isn’t it?

further, afaik there's no assignment nor iteration in SQL. it's not a programming language at all. It's a.. query language.

I'd consider INSERTs and UPDATEs to technically be "assignment", even if they're very different from how other languages do it.

Some SQL dialects do support both traditional variable assignment and iteration for those cases where an iterative/imperative approach makes more sense than trying to shoehorn the problem into something set-based / declarative. Some limit them to stored procedures (e.g. Postgres, and AFAICT Db2), while others allow them pretty much anywhere (e.g. SQL Server / T-SQL).

> further, afaik there's no assignment nor iteration in SQL

That was until CTEs were introduced in SQL:1999

> it's not a programming language at all. It's a.. query language.

The two are not mutually exclusive. SQL is used to tell a computer what to do, and it is very powerful at it: https://www.youtube.com/watch?v=wTPGW1PNy_Y

I tend to lump it in as a "logic programming" language, along with Prolog and Datalog.

While there's no formal definition of functional programming (that I know of at least), SQL doesn't support higher order functions, which are fundamental to all functional programming languages. That said, it's definitely declarative.

One feature I'd expect a functional programming language to have is functions as 'first class'. (e.g. can pass as input to higher order functions).

But e.g. PostgreSQL doesn't have functions as a datatype. https://www.postgresql.org/docs/current/datatype.html

So, uh, SQL isn't 'functional', right?

Depends what you mean by "programming".

But virtually all query languages are declarative; e.g. GraphQL.

Give the data scientists SQL and relational algrebra.

But please don't give them stored procedures and triggers.

Data scientist who was given stored procedures and triggers. Can confirm dangerous activities took place.

I get triggers, but what's wrong with procs?

Not who you are replying to, but I believe the issue is version-control unfriendliness. At worst, they'll input everything to the DB directly, at best you'll have migrations, where you'll have to track down what the current piece of code is from a set of files.

You can, and I think you should, place your data and stored procedures in separate schemas. This way, migrations are only required for the data, and stored procedures can be deployed and version-controlled just like any other code.

What do you mean about version-control unfriendliness? Stored procedures along with all of the other schema objects such as tables, view, UDFs, UDTFs, etc... can be version controlled.

You can do it, but it is unfriendly to that model. If you track migrations, you're tracking the changes on top of a change system, so you'll often have to track down which file has the latest version of something manually instead of relying on the VC layer.

If you don't track migrations, and track only the latest CREATE statement, you can't deploy to a real system with that.

If you track both migrations and a final state, you'll end up in an awkward situation, because you'll likely have to do your migrations first and rely on automation to render your final state (in how many files? under what criteria?) or manually copy-paste it without making mistakes while coming up with your own structure. Or worse, writing on the final state set and then carefully copying changes back to migrations.

It's certainly not the worst thing in the world, but it is very unfriendly compared to the usual way of development under version control.

Okay... I see what you mean.

We've been experimenting with versioning schemas; deploying new versions of schemas along side existing ones. There is never any question about what version of a stored procedure is in that that schema. Same goes for all of the other schema objects. There is no schema migration tool needed. Of course I have to mention this is in data warehouse scenario and not an operational data store.

you CAN track the latest "CREATE" statement and deploy to a real system. Conceptually you only need to do a "diff" between your current "CREATE" and the on in the system where you are deploying. https://github.com/djrobstep/migra

Also something similar which uses apgdiff https://github.com/subzerocloud/subzero-cli

You bring up an excellent point. I had forgotten about auto-migrators because they encourage not being as conscious about the impact of potentially major changes to data since reading a tool's output is less cognitively engaging compared to writing something that works.

But stored procedures are another matter, this could work in a fairly reliable way. Have you used this in any project working in other people? How does it work out?

(genuine question) What are the best alternatives to triggers? And what makes them a bad idea?

I'm pretty much with you on stored procedures.

I am sure there are good use cases for triggers but I have seen quite a few databases that used triggers a lot and it almost always felt like the equivalent of spaghetti code. Things are happening and it takes forever to figure out why they are happening.

Depends on your perspective. Yes, if you are an application developer with weaker skills to access the data integrity logic in the database. No, if are a skilled database guy with weaker knowledge of all the source code of all the applications (could be multiple) sharing the database. Things are happening and it takes forever pouring thru each app's code to find its data integrity logic to figure out why they are happening.

By centralizing data integrity logic in the database, you know where to look and that all apps using the database will abide by it.

I agree with you but in the cases I saw I felt that the triggers were used to fix problems in the code more than being part of a consistent data strategy. I admire well designed databases but unfortunately there aren’t too many of them out there.

I think part of the problem is that there is still this huge chasm between good coding skills and good database skills. It’s hard to have both.

Triggers definitely have their place and can be very useful. But they can pretty quickly grow out of control leading to lots of side effects making it difficult to maintain and possibly leading to performance issues. I think twice before adding a trigger to consider if it is really necessary because using triggers as your first tool to solve problems can lead to a lot of complexity.

I like the author's point about this:

> The trigger function itself was also written in the “schema.sql” file which meant that everything was clear and documented along with the other relevant schema information, thus easier to track and manage.

In my experience, in most cases the application doing the insert/update/delete is the best place to handle most side effects. That may be an on-demand API, a scheduled/queued job, etc. I've found triggers to be helpful and simple for audit and history tracking. There are definitely plenty of cases where triggers are useful and it probably depends a lot on your architecture.

We're talking about "data scientists" for whom software development is not their primary focus. Since SPs and triggers do not share SQL's advantage of being declarative; they are instead imperative, and non-developers should spend as little time as possible within imperative domains.

I'm not as anti-trigger as I am SP. Though I stay away from triggers in any case as the rest of the world is fairly negative on them.

For which usecase(s)? Triggers seem to be used for many different things...

I've worked extensively with SQL and relational data modeling. It's been very useful for accomplishing my work. I haven't come across better tools and so I haven't adopted alternatives. If the time ever comes where there is truly a superior tool set to the one I am currently using, I will gladly stop using antiquated technology. Newness isn't sufficient to sway me. No dogma here. Just pragmatism.

Data structures still matters and very much so! If you run in the cloud data structures has to be very efficient! Data should be normalized. MySQL has built hash table support adaptive hash indexes.

I think one should also not over complicate the data layer.

Boring tech, I like the safety C D of SQLs ACID. http://boringtechnology.club/

> Data should be normalized.

This is a rule of thumb, only, and depends on your definition of efficiency and your queries. If you normalize everything, especially with some analytics queries, you will quickly find silly barriers to query times caused by all of the required joins.

Sure, but now we're getting into higher level system design. The schema used to run the operational system may/will be different from the one used for reporting and analytics. This is what data warehouses were born out of with the flattening of data, star/snowflake schemas, dimensions, etc...

> On the other hand, if you check Postgres’ configuration file, most of the parameters are straightforward and tradeoffs not so hard to spot.

The artcile also refer to MongoDB as hard to config It maybe a matter of taste. I found MongoDB document is way better than Postgres. They had thing like this:

https://docs.mongodb.com/manual/administration/production-ch... https://docs.mongodb.com/manual/administration/analyzing-mon...

Which I can easily follow and apply and they are action-able like set noatime on fstab, use XFS, max file handler etc.

For Postgres https://www.postgresql.org/docs/12/index.html I cannot easily find something similar to MongoDB one.

Good article, but it and a lot of the comments here seem to be conflating SQL as a language and Postgres (or other RDBMS implementations.)

For the record, I like both these things and they often go together well. But once your data gets too big for Postgres, you don't have to immediately jump to NoSQL: modern distributed SQL tools like Presto are quite good and can let you continue to use the same SQL interaction patterns, even at the terabyte or petabyte scale.

You have to be a little more aware of what's going on under the hood, so you don't write something pathological, but it's quite a powerful approach.

I am even using Presto for a lot of what would normally be considered ETL: using SQL to select from a "dirty" table and inserting into a "clean" table on a schedule.

And presto's rich functionality for complex data types (maps and arrays) as well as window functions makes some pretty challenging things possible.

Start with Postgres. Don't start with SQLite. SQLite is a file format, not a database; it scales atrociously (I've seen simple queries run for 10s of seconds with 100MB of data), it basically doesn't work in concurrent update scenarios, and the more data you put into it, the more pain you'll have migrating to Postgres.

Use SQLite if you want a file format where you don't want to rewrite the whole file for every update, or if you're in a situation or environment where it's not feasible to have an actual database server.

Sure if your use case is an analysis of interconnected entities and you either work alone or have someone to help you maintain a shared server.

Otherwise please don’t. I earn my living (amongst other things) due to organisations that went with this as blanket approach.

Use a right tool for the job. SQLite is fantastic for small to medium size datasets, shines in immutable case. Plus its maintenance and sharing cost is close to zero(something you will cherish once docker comes to play or if you want to learn or test some SQL without going through a pain of setting local permissions for your schemas).

I think SQLite is terrific. I don't know what you mean with that it is not a database. It's SQL querying capabilities are definitely database-worthy.

> SQLite is a file format

Well, you could say the same about Postgres...

They are both databases, and the lack of scalability or (not) being able to run as a service does not change that simple (and useful) fact.

As someone who has to use Elasticsearch as an only data store, yes.

Can you elaborate on which issue(s) you encountered? We are considering Elasticsearch at the moment for document storage and search, versus postgres.

FYI, you might want to check out ZomboDB[0], which integrates Postgres and ElasticSearch. It's open source[1] and, fwiw, the developer was helpful when I pinged him with some questions a while back (and is available for consulting services).

From the project's github page[1]:

ZomboDB brings powerful text-search and analytics features to Postgres by using Elasticsearch as an index type. Its comprehensive query language and SQL functions enable new and creative ways to query your relational data.

From a technical perspective, ZomboDB is a 100% native Postgres extension that implements Postgres' Index Access Method API. As a native Postgres index type, ZomboDB allows you to CREATE INDEX ... USING zombodb on your existing Postgres tables. At that point, ZomboDB takes over and fully manages the remote Elasticsearch index and guarantees transactionally-correct text-search query results.

[0] https://www.zombodb.com/

[1] https://github.com/zombodb/zombodb

I typically keep search indexing separate from document storage and delivery. That way I can design each independently to do what they do best. Something like postgres can do both, but I would still logically separate the problems so I can move to better solutions when they arise.

Come on now even ES has an SQL interface.

>use Elasticsearch

Found your problem.

One point got me curious:

>As a rule of thumb, vertical scalability is in most cases more economical than the horizontal one.

Isn't the opposite the reason why we started to scale horozontally in the first place?

No, it's because projects with huge amounts of data were growing beyond the limits of what you can reasonably do on one machine. Machines were less capable then (smaller disks, less memory), and those limits are a lot higher now. If your data is small enough to fit on one (very beefy) machine, then it's probably still cheaper to pay for that high-end machine vs. distributing to a bunch of less capable ones.

There are exceptions - distributing the data can be really helpful if you need to do a lot of bulk I/O (ETL jobs, analytical queries, etc.), but it comes at the cost of making "transaction" use-cases difficult and expensive. Using a scaled-up OLTP[1] database for user interaction and a scaled-out OLAP[2] database for analytics and ETL jobs is a common pattern.

[1] https://en.wikipedia.org/wiki/Online_transaction_processing

[2] https://en.wikipedia.org/wiki/Online_analytical_processing

Big machines got cheaper.

Amazon will rent you a machine with a TB of RAM in it for an hour for the price of a fancy coffee.

Horizontal scaling has enormous complexity costs. It's worth it if you are genuinely web scale (Facebook, Twitter etc) but very few projects are.

The more pertinent reason to scale horizontally rather than vertically is that horizontal scaling without downtime is easier (if you've nailed down / automated node deployment, which is a big "if", albeit one made easier if you're using any of the big PaaS/IaaS providers):

With horizontal scaling:

1. Spin up the new node(s)

2. ???

3. Profit

With vertical scaling:

1. Spin down the node¹ (hopefully you've got more than one!)

2. Resize it

3. Spin the node back up

4. GOTO 1 unless all nodes are scaled up

5. ???

6. Profit

OR (somewhat simpler, but at this point you might as well just horizontally scale):

1. Spin up the replacement node(s)

2. Spin down the old node(s)

3. ???

4. Profit

That is: vertical scaling has more steps, even when done in a way similar to horizontal scaling. Sometimes it's necessary to scale vertically instead of horizontally, though (but in that case you might as well just horizontally scale with bigger nodes).


¹ It's theoretically possible to add and remove resources to/from a running system; some mainframes offer it as a feature (Multics programmers in particular were known to yank hardware out of production systems during off-hours and plug 'em into dev systems for development/testing, then plug 'em back into production for on-hours, all without bringing down the system), and Linux supports hot-adding and hot-removing CPU and RAM provided the underlying hardware supports it (Windows supports hot-adding as of Windows Server 2008 Datacenter Edition, but not hot-removing, last I checked). However, I'm not aware of any PaaS/IaaS providers offering compute instances with this capability; EC2 instances definitely don't support it, last I checked (switching from one instance type to another absolutely requires stopping and restarting EBS-backed instances, and instance-storage instances have to be outright imaged and then reimaged onto a new instance of the desired type).

Querying data in “Cypher” is so much easier. I have 20 years as a database developer, cypher is way better. I can live code 50 complex queries in Cypher before I get one done in SQL.

What is Cypher?

It’s best not to take criticism seriously from people who have no formal CS education and can’t even define the premises of relational algebra. Ah well.

Do you have any particular critiques of the author's content? Otherwise, this feels like gate-keeping, since you provided no specific critiques of any content.

Not the author, I’m referring to the people that drive the author to have to make a post like this at all.

Alright here's a relevant question I've been having in terms of this. Let's say I have the code to gather / scrape / load some stats into postgres, but then want to run projections on them.

For example, if I'm trying to predict the next day's stats by using stats in the past by simply taking the average, how many days in the past should I look at results and take the average of? Is the last 3 day average the best? 5 days? 8? 10?

There are clearly a couple ways to do it. One is by getting a data frame of all the stats for all the objects, write the python logic to loop through the days, get back the stats from the past X days not including the current day, taking the average and then storing that back to postgres in a projections table. A function like set_projections(5) where 5 is the number of days in the past I'm taking the average of.

Second way to do this is write that function as a plpgsql function where uses subqueries to find the past X day stats for the players and then creates or updates the projections table all in sql so we can run `select set_projections(5)` and that'll do it itself.

So the question becomes, which ones is "best"? I have to imagine it's mostly a case by case basis. Since it's only me here, I've been doing it in postgres alone since it can be done in one query (with multiple sub queries, yes), but that's it. With python, it'd involve many more steps. On the other hand, the sql looks huge and then I've been running into the issue of do I split some of the sub queries into sub functions since that's what I'd be doing in python? If there were more people involved, would it be bad to have larger cases like that in postgres since we wouldn't know the skill of the others, where mostly they'd be coders and could write the projections in languages they'd want?

Another example of this tradeoff is how should I interact with the database? I have a rails background, and ActiveRecord is an incredibly good ORM, whereas SqlAlchemy hasn't done it for me. In either case, there's a ton of overhead to getting the connections running with the correct class variables. So instead, I kind of created my own ORM / interface where I can quickly write queries on my own and use those in the code. This is especially easy since most the queries can be written in postgres functions so the strings of those queries is incredibly tiny!

What I've learned from this project I'm doing is that sql is very, very powerful in what it does. I've shifted more to using it for actions than I would have in the past, and pretty much all thinking I do is making as little code as possible.

Anyone make it through reading this and have comments about what I'm doing and what they like to do?

It sounds like you're calculating a running average. Have you looked into window functions? I think they do exactly what you need. Something roughly like:

  SELECT AVG(stats.val)
  FROM stats
Regarding this point:

What I've learned from this project I'm doing is that sql is very, very powerful in what it does. I've shifted more to using it for actions than I would have in the past, and pretty much all thinking I do is making as little code as possible.

It's an interesting tradeoff. I've been on the other side, in an environment that was super SQL heavy (also Oracle PL/SQL) where we had to deal with these huge complex queries, and it was incredibly painful, for various reasons. One, SQL by itself isn't that composable in the way code is[1], so there tends to be a lot of copy/paste/duplication. Two, giant queries can be hard to understand/debug. Three, SQL can be hard to test. I mean, you can set up integration tests against a database (which can be time consuming), but how do you unit test a query?

On the other extreme, I've seen cases where something separately queries two tables and does a join operation in code, which would be much easier/safer just by doing a sql join. So there's a tradeoff. There's a hard to define sweet spot somewhere in the middle, IMHO, where you take advantage of relational semantics/joins, and utilize various sql features like group bys and whatnot, but where you haven't shoved all of your logic into the database layer.

[1] You can create CTEs (with foo as....) in postgres, but they may have downsides wrt performance (at least, last time I'd read up on it they posed an optimization boundary). You can also create views, but similar issues may arise.

Re the CTE optimization boundary comment: this changed in Postgres 12! By default, they are no longer a boundary. You can change the behaviour on a per-query level, too. As always, depesz has an article going into the details: https://www.depesz.com/2019/02/19/waiting-for-postgresql-12-...

In general it's been my experience that if you try to do stuff SQL could do re: reporting or heavy data work in Ruby or Python or some other scripting language in a web application context, you're in for one or both of "why is this thing so damn slow?" and "why is this report crashing the application? (answer: it's eating all the memory)"

I've also seen highly user-visible performance issues because some developer either didn't realize how slow performing a series of queries in a loop would be (didn't understand the network overhead of each request) or didn't know how to condense what they needed to one or two requests. High correlation of this with the ORM-dependent folks.

So far as averaging a few days of stats my gut feeling is "this can just be a view with one or more medium-complexity selects behind it" but maybe there's some reason it can't.

> So instead, I kind of created my own ORM / interface where I can quickly write queries on my own and use those in the code. This is especially easy since most the queries can be written in postgres functions so the strings of those queries is incredibly tiny!

I haven't been in non-Rails Ruby land in quite a while but I remember liking Sequel quite a bit. Very much The Right Amount of abstraction (i.e. barely any) over the DB.

What is "best" probably depends on your current metric for "best". Developer time spent? (Is developer and DBA the same person? How familiar are you with SQL?) Processing time? Amount of data transferred? Readable code? Maintainability? Versioning?

I've also shifted more logic to Postgres recently and keep the queries in the code trivial. It's because I like that SQL is declarative and there's no intermediate state to mess up (the whole query can be processed within a transaction).

As for readability (refactoring) you have several options in Postgres. Going the PL/pgSQL route probably means you're gravitating back towards procedural code. It does have its uses, but I try to avoid it whenever I reasonably can. Try using language SQL [1] instead. Another option is functions. Can be more readable but likely less efficient (they're a barrier for the query planner), just as PL/pgSQL. I've been mostly happy using nested views (and materialized views) lately. But whenever you can, use CTEs to structure your queries.

YMMV, just providing ideas to think about.

[1] https://www.postgresql.org/docs/12/xfunc-sql.html

It sounds like you should research window functions: https://www.postgresql.org/docs/12/tutorial-window.html

Sensitivity Analysis: allow your lookback period to become a variable itself, and then analyze/optimize predictions for that lookback period variable itself.

We use SA's all the time in my work. One example, we only had 2018 data in our SA, even though our full analysis included 2019 data...this let us see if trends changed year to year. Another example: we made 5 lookback variables, 90 day, 180 day, 365 day, etc...then we analyzed which of these was most relevant to our analysis.

I might be under-informed but "data science" seems to involve a lot of vague BS. Computing has always centered on data.

"Data practitioners", "data practice", "data management" just seems like weird rebranding of stuff that businesses have been doing since the 1960s. Partly what the article seems to be relating regarding SQL.

What is "data science" besides computing, data storage of some sort, and analyzing the data? Because it's "BIG" now? Because now we "extract knowledge and insights" from data, since apparently giant databases were amassed for no reason in the past? Because now we "combine multidisciplinary fields like statistics", since apparently nobody had thought to run statistics on their data before? Because "AI"?

Perhaps a more useful characterization would be that it's what happens when you take work that would traditionally belong to statisticians and actuaries with only limited programming and IT skills, and give it to someone who's a more of a jack-of-all-trades programmer/statistician.

I don't think that this leads to changes in the traditional ways of handling IT because of any fundamental change in a data analyst's needs. It's more that, historically they were outside the formal tech sphere, and therefore lusers who had to make do with whatever resources they were able to squeeze out of IT. Now, armed with better technical chops, they're able to make a stronger case for their needs, which means that more is being done to meet those needs.

Which is not to say that your skepticism is unwarranted. "Data science" is still a catch-all term for a new and ill-defined way of working, and things are still pretty wild west. Anyone who tells you they're sure they know what the best practices are is probably trying to sell you their product, which happens to be the best practice. Not many people have the benefit of 20 years of hard-won experience from which to draw a set of really strong guidelines.

The big difference is that historically you could run pretty much all analysis on a single computer (if not in Excel - I’ve seen and built many actuarial models in Excel myself) but consumer internet scale data (both in terms of the number of rows and the number of features available) is too big to model without distributed computing and more programming work on the part of the analyst.

Moreover it’s an iterative process where you start with hypotheses on what features will be useful then build a model and might need to pull more data. For example you could build a recommendation engine just using the set of products purchased by customers but then might explore adding user agent features like devices or might add data about the products like category or price information.

An analyst (“data scientist”) who can pull and interact with that data is much more effective than one who needs to wait for someone else to pull the data for them (most actuaries I know are weak programmers outside of R and wouldn’t know where to begin using distributed computing or getting a TensorFlow gridsearch running on a cluster of cloud GPUs :)

I actually think that analytics clusters is one of those wild west things. The statistician in me is suspicious of a lot of the Big Data movement, for various reasons that could largely be summarized as, "I'm pretty sure these ways of working are motivated less by any true analytics need and more by the needs of cloud services providers to sell cloud services."

I like some of these technologies for taking care of the dragnet data collection and engineering, but, when I'm actually doing a data analysis, it's rare I want to run it on a Spark cluster; I'd much rather run a Spark job to collect the information I need and sample or digest it down to a size that I can hack on locally in R or Pandas. Yeah, there will be some sampling error, but the potential dollar cost associated with that sampling error is much lower than the cost to eliminate that sampling error. And it's basically zero compared to the elephant in the room: the sampling bias I'm taking on by using big data in the first place. "Big data" is, from a stats perspective, just the word for "census data" that people like to use in California.

I totally get your skepticism but the bottom line is that throwing computing power at a problem generally leads to a much better solution even if only because you can do a much more extensive grid-search. This doesn’t have to be using some complex cluster like you could just have n copies of the same VM that each run experiments with parameters pulled from a Google sheet until all the experiments are done.

Sure, the cloud providers sell computing power but it’s a race to the bottom and makes much more sense than buying hardware for these kinds of bursty analytics workloads.

I don’t think “Big data” from a stats perspective is analogous to census data - in some cases yes but for applications like recommendation engines you lose a lot of valuable signal by sampling.

Given that GCP will happily supply VMs that can be resized to dozens of CPUs and hundreds of GB of RAM and back, billed by the minute, sampling bias isn't even a necessity for quite a lot of things a laptop wouldn't be able to handle. I used to write Spark jobs for those slightly-too-big data problems, but Pandas and Dask are quite sufficient for lots of things, without all the headache that distributed computing entails. Plus, data people have no need to store any potentially sensitive data on their personal machines, that's another headache less. It's not going to work well for petabyte-scale stuff, though. I guess for those kinda things and for periodic bespoke ETL/ELT jobs, Spark is still useful.

The ability to re-size GCP VMs totally blew me away when I first discovered it. Just power off the machine, drag the RAM slider way up and turn it back on. SOOO much easier than re-creating the VM on AWS.

I also way prefer to just crank up the RAM on a single instance and use Pandas/Dask instead of dealing with distributed computing headaches :)

I'm pretty suspicious of the idea that most companies have data that is too big for a non-distributed environment.

I do think that most companies log lots of data in different places without much thought or structure, and as a result need a "data scientist", who spends most of their time simply aggregating and cleaning data that can then easily be processed on a single machine. ie: they spend 10% of their time doing statistics.

Maybe I'm wrong, but the above is how all the data scientists that I know describe their work.

In my experience it's a bit of both.

Most of a data scientist's time is definitely spent getting data from various places, possibly enriching it, cleaning it, and converting it into a dataframe that can then be modeled (and surprisingly a lot of those tasks are often beyond pure actuaries / statisticians who often get nervous writing SQL queries with joins nevermind Spark jobs)...

But also once you've got that dataframe, it's often too big to be processed on a single machine in the office so you need to spin up a VM and move the data around which also requires a lot more computer science / hacking skills than you typically learn in a stats university degree so I think the term data scientist for someone who can do the computer sciency stuff in addition to the stats has its place...

Data science, unfortunately, is often considered a buzzword because it varies so significantly from org to org, and often there's a lot of No True Scotsman arguments resulting from that which makes everyone unhappy. But at minimum, every data scientist needs to know SQL; there's no way around it.

I wrote a full blog post ranting about the disparity between expectation vs. reality for data scientists a year ago: https://minimaxir.com/2018/10/data-science-protips/

> But at minimum, every data scientist needs to know SQL; there's no way around it.

Is it no longer the case that every programmer has to know SQL? I have always and still do consider knowledge of SQL a minimum requisite for any programming job...

It is not the case that every programmer has to know SQL (if it ever was).

Programming is now such a big field that plenty of people will go their whole careers without knowledge that's foundational for other programmers - whether it be SQL, linear algebra, memory management, distributed systems, ...

In a way, it's very cool. Instead of a single-dimensional "how good a programmer are you", we've got a multidimensional space. That, naturally, yields a lot more interesting corners, like, say, what could be invented by someone who knows a lot about databases and GPUs.

There are plenty of programming jobs that have no need for SQL, it's just easy to forget how broad the field is. There was never a need for SQL when I was programming microcontrollers, or DSP work in python/matlab

Granted, SQL is an extremely useful skill for programmers, but there are use cases (e.g. front-end dev) where it's not required.

Just because "computing has always centered on data", doesn't mean that every aspect of that was done well (not implying any recent changes brought by "Data Science" here).

For a lot of systems, a lot of engineers/programmers often care very little about the "data", and they very often don't need to. E.g. on a high abstraction level your webserver doesn't care if it's sending HTML, JSON or whatever, and here we are just speaking about formats and not even higher semantics of the data.

I think there can be a lot of value in data-focused roles in companies, because there are many (hard) questions around data that are often badly answered as afterthoughts:

- What are the exact semantics of this data?

- How do make sure the data is clean?

- How do we evolve the data over time?

A good data scientist is a jack of all trades - mediocre programmer, mediocre ML modeler, mediocre ETL architect, mediocre statistician and a mediocre analyst. You hire this person because you don't need a specialist in each of these fields But you need someone who can do these things. Also you don't want to pay a metric ton of money so you can't expect some superstar.

Good data scientists also need to be great at a couple of things. They must be excellent presenters, and excellent diplomats. Getting the data is the hardest part. Data silos are create all over the place and getting access to that data requires speaking to a lot of higher ups.

Can confirm. Am data scientist. Am mediocre at everything :)

> Also you don't want to pay a metric ton of money

Doesn't the job title of "data scientist" still correlate with high pay, or has the hype started to wane?

A data scientist doesn't command the same premium as a machine learning engineer or a kubernetes expert, I'm afraid

Depends where. I'm at Amazon, and they are really serious about not letting DS become just business analysts who just SQL/Python. But elsewhere I do get the impression there is some dilution. Although honestly, that's natural. Just like how in the SDE world there are template web devs and hardcore infrastructure guys, who get paid massively different amounts but have the same job title, that's happening to DS.

Depends on the location and company. In a smaller city working for a smaller company an entry level data scientist might be happy to pull in 60k. In a major city working for a large company you can at the very least double that number.

If the person can do all of these in a way that makes them a profitable investment, it doesn't really seem so high

These people often have PhD's, I doubt they are mediocre statisticians or mediocre analysts.

Hmmmm. I wish your implication was correct, but I fear it's not.

>Computing has always centered on data

Sure, well-thought out data structures were an integral part of good clean code, but the bulk of data was an artifact of software in the form of logs and largely unused database records, not the "center" of anything. Today, there are many people who take these logs and database records and run very specialized code on them. I entered industry in the mid 2000s and it was very difficult to find jobs that concentrated on this component back then. We're often called data scientists today, and there's orders of magnitude more attention that goes into it. This is a real shift that has brought data into the "center" of many organizations.

I used to work for a big consulting firm. A pretty fancy one, not the fanciest, but fancier than most that claim to be fancy.

10 years ago they had a big initiative around “Analytics.” I’ve programmed and worked in data for a few decades so I tried to figure out why these biz people were interested in Analytics.

When I asked they brought up all the terms you said. I showed examples from various projects how they were just describing basic data programming. But they disagreed because this was new.

I figured it out a few years later when they spent $5B on the initiative and it was making money like gangbusters. This was now a product line. It was new that they were selling it.

So now there’s people who want more money and need to differentiate and thus the titles and whatnot. I wish there was an easier way to view work output of people because now the only way I can differentiate between “data engineers” and “data scientists” and “AI wizards” is by looking at their GitHub profiles or talking to them in depth and profiles are almost never complete or accurate and talking takes a lot of time.

I think the term “data science” has largely been co-opted by data analysts and SQL analysts. At a big software company that you’ve definitely heard of, “data scientist” consultants were people who created daily reports for upper management that they created by clicking through graphical UIs charting usage data (ie we gained a lot of users in this region, because of this customer, or lost some because of a holiday). Data scientist employees were SQL analysts and data engineers.

In my experience most of the people doing what I consider data science have the titles “applied scientist” “machine learning engineer” “data engineer”.

I'd say that data science adds some domain knowledge to computing. Usually, the domain knowledge is in business analytics, but it could also be medicine or social sciences. Where traditionally trained computer scientists fail is not so much at implementing the solution, but being able to gain beneficial insights either from the needs of the clients or from the results of computation.

After a new round of hiring, we came to the conclusion that "pure" computer scientists will not do the cut, even those with a phd, simply because we are quite convinced that they won't be able to understand the problem at hand.

Data Science as a practice is not anything new. There has always been practitioners of statistics, math, prediction, and so on. What is new is the merging of technologies that harnesses the large amounts/kinds/disparities of data that is being stored and processing it into useful information to drive decisions for an organization. Thus a term was coined "Data Science".

The term was coined nearly 20 years old by some accounts, but is probably older.

This is a good summary of its meaning and history: https://en.wikibooks.org/wiki/Data_Science:_An_Introduction/...

Isn't this sort of marketing of a fancier job title not a theme in the entire industry?

Programmer -> Developer -> Software Engineer

Sysadmin -> Operations Engineer -> Site Reliability Engineer

QA Engineer -> Automation Engineer -> SDET

Sounds pretty dismissive. AI and Big Data are buzzwords that shouldn't be used very often. Data Science is about creating solutions that are right for the size of your data, and the questions being asked of it.

A C++ program with a couple megs on the heap is appropriate, just as an ETL pipeline is appropriate for gigabytes to terabytes.

The joke is that a data scientist is a statistician who works in San Francisco.

RDBMS can be replaced by Impala, Spark SQL, Drill, Presto, or any SQL engine on top of Hadoop.

Hey you've scraped my email from GitHub and sent me this link in some kind of newsletter

I don't think this is GDPR compliant.

feedback tha paroume?

The damage SQL has done to the relation of perception of he general public towards the relational model cannot be undone. The relational model is beautiful; it's in no way more complex than objects and attributes, but SQL makes it seem so by conflating orthogonal aspects into it.

Care to elaborate? What you've said resonates, but I lack non-sql relational understanding with which to really evaluate.

Relational algebra didn't evolved into a relational programming language. SQL is merely a query language. The recursion is horribly done and there is no type systems. Imagine this, Friends(a: Person, b: Person) defines the relationship and the foreign keys at the same time. It makes the reasoning easier too. mary.Friends.Friends get all friends of friends of mary who is a Person. SQL requires you to write a lot of joins to achieve this. Error prone coding experience. That is why there are ORMs which end up with half baked solutions.

In fact, logic programming language and SQL should consolidate into a relational programming language. Every thing we write as a program, automatically supports persistent and distributed storage. It can also support probabilistic computation to have machine learning involved. Then we will have a complete data driven software solution.

Unfortunately, right now, we cook everything up with SQL, python, operational DBs, analytics DBs, Spark, Tensorflow.

It could have been a better place.

In the Hadoop world things have evolved to support SQL. Spark, Hive, Impala all have full support for SQL. The Spark implementation is actually faster than you doing low level RDD processing as there are optimizers that work very well. In addition, you can create UDF that are easy to integrate.

The only reason you might choose other approaches is to make the problem look significantly more complicated, thereby justifying more maintenance and resources.

SQL is just too easy and some super smart engineers don’t like it because if it. That said, NoSQL does have value in some corner cases where it could perform better when most of the logic is simple lookups.

“Why didn’t we use an RDBMS in the first place? “

Because initial application specifications are sparse and definitely wrong. If your application is still up and running 5 years later and your data definition hasn't changed much in the past 3, then maybe refactor around an RDBMS. Designing around rigid structures during your first pass is costly. This is why there's been a rise in NoSQL and dynamic languages.

It seems the other way around to me. RDBMSes have well-defined ways of managing change in schema. Through a combination of modifying the schema itself and judicious use of views and stored procedures, it's often relatively easy to evolve the data model in ways that produce minimal disruption for the application's consumers.

Contrast with, I was at a Cassandra workshop a few weeks ago, and the speaker, when asked directly, conceded that, in Cassandra, you really do need to nail your schema on the first try, because there are no great retroactive schema migration mechanisms, and any evolution is going to result in all consuming applications needing to know about all possible versions of the data model. Which ends up being a huge source of technical debt. And heaven help you if you didn't get the indexing strategy right on the first try.

I think that this might point to the classic tension between easy and simple: RDBMSes are focused (admittedly to varying degrees of success) on trying to keep things simple, but there might be some work involved. NoSQL solutions are often sold as being easy to work with, and I don't deny that at all, but my experience is that, in the long run, they can become a huge source of complexity.

This isn't a tension that's unique to software. In my contractor days, we'd also do things one way when it was just a quick-and-dirty job, and a whole different way if we were looking to build something to last. e.g, you'll never catch me using a sprayer to paint my own house, no matter how much faster it is.

I've been trying to deal with migrating data in a Firebase database, and it's come down to exporting the data to JSON and searching through it with ripgrep and jq. Not ideal! And I've had to deal with entries missing "required" props, and entires that have keys with typos, etc. Luckily our dataset is still small enough that this is practical, but none of this work would be required if our product had been built on a SQL database in the first place (I wasn't around when that choice was originally made).

Completely disagree with this. SQL is more flexible than NoSQL. NoSQL databases generally have poor query support, which means if you want to do complex queries you need to precompute the results. This requires you to know your requirements up front. If you're using SQL, then you can do these queries on the fly.

Sure, SQL databases make you defined a schema. But it's very easy to change this when requirements change.

It's really not. That's a myth and you get all of the benefits that you want just by designing with a dynamic language.

It's less costly to create a relation table when you realize there may be multiple instances of a piece of data associated with a record than it is to just stick those pieces in an array.

Because when you don't use the relational data, you get the extra work of modifying all of your existing NoSQL records to use the array structure. And as a bonus, you make it easy to do queries in both directions with the relational data.

NoSQL offers virtually no efficiency benefit unless you're actually consuming unstructured and variable data.

With NoSQL solutions you're typically pushing data migrations to code. Yes this is technical debt. But not having to deal with SQL based data migrations is pretty big time saver early on.

Not really. Writing a SQL migration takes what, 10 minutes max? Or you add the columns as you go, and it all just merges into the normal dev time of the feature anyway.

You'll easily make up this lost time just in not having to immediately clean up crappy data that you've written to the database while you're developing the feature. That's been my experience anyway.

I mean, we're all programmers here, and we've all dealt with the growing pains of changing requirements.

But has an RDBMS ever been a major source of that pain? I can't say I've ever encountered a time when it has. If you need to change the structure, just write a script.

I'm not saying it's completely painless, but nothing is.

Before I started using migration scripts it was a bit of a pain (and before there was off-the-shelf libraries for this), it was a bit of pain. But these days, nope. MySQL is slightly more annoying than postgres because it doesn't let you wrap DDL queries in transactions so you can get left in an inconsistent state if your migration has a bug. Postgres is seamless.

IMO if your data model's still so nebulous that you're changing it so often that SQL migrations are a serious impediment to progress, you probably don't need any real datastore yet. You can usually figure out WTF you're going to do, broadly speaking, before persisting anything to a remote database. And if you can, you very much should.

Yes, yes, there are sometimes exceptions, one must repeat explicitly despite having already said it (see: "probably") because this is HN.

Can someone explain to me any circumstances where having no well defined data model is better than coming up with a clear relational model? Honest question. It seems like it would just be a huge hassle trying to deal with your dissimilar data.

When I'm designing an app from scratch I often think about the SQL tables first and how I'm going to build them, and it really sharpens my idea of what my program will be.

I don't see how skipping that process would make things easier.

I've built IoT platforms where data from any device must be accepted. This is largely where my preference for NoSQL comes from. A device created tomorrow will not have a schema I can predict or control. NoSQL allows the easy integration of that device while a traditional database will at worst require a migration for each new device you want to support.

Please correct me if I'm wrong, but that sounds like a very narrow use case, and also something that could be solved by simply stuffing JSON into a RDBMS.

However, perhaps there are tools that NoSQL provides that are handy.

What's wrong is that you asked for any usecase and then critique one because it's not broad enough for you.

Sorry, I'm not trying to be argumentative, I just argue in order to understand better.

Then argue honestly and work to steelman other's arguments. To address your point, I'd hardly consider IoT platforms to be a "narrow" usecase. Smaller than the whole of computing, surely, but it's a growing field. The reality is that more and more devices will become available that generate all sorts of hard to predict data. Being able to handle those easily will be a large strength for platforms going forward. Dropping this hard to predict data as JSON into a RDMS will certainly come back to haunt you in 5 years.

How will it come back to haunt? Again, just curious.

You'll have dumped it into a strict database, giving yourself a false sense of order and organization. But later when you need to query that amorphous data, you might be able to use OPENJSON or something else, but a NoSQL solution will have been built to handle this type of query specifically with utilities like key exists and better handling for keys missing or only sometimes being present.

You can't really design your tables well enough without knowing your UI, its a back and forth, forth and back process.

And it's something you can do entirely on paper, too, before you go to code.

However I've never greenfielded an actually large application.

How does UI inform table design?

It will give you lots of insight when planning your data models/tables.

I've found that MongoDB is really the Visual Basic of databases: It lets you rapidly get something running where your persistent data looks very similar to your data structures.

But, more importantly, this quote really is right:

> The more I work with existing NoSQL deployments however, the more I believe that their schemaless nature has become an excuse for sloppiness and unwillingness to dwell on a project’s data model beforehand.

Far too many software engineers just don't understand databases and don't understand the long-term implications of their decisions. Good SQL is rather easy, but it does take a few extra minutes than sloppy approaches.

Or, to put it differently, some NoSQL databases are great at rapid prototyping. But, they can't scale and hold little value for someone who will put in the extra hour or two upfront to use SQL.

If you embrace both structure and a defined path to change that structure, you get the best of all worlds. Database migrations and scheme changes should be the norm, not a feared last resort.

Otherwise you end up with equivalent logic littered throughout multiple application codebases sharing the same database.

I feel like this is where ActiveRecord shines, as the initial development is thought of as objects and their relations to each other and it's very easy to alter data definitions. And in the end you get a reasonably normalized database underneath everything. The reality is most of your models aren't going to be undergoing wild schema changes and if they are or you need traversable unstructured data you can defer to json columns for those scenarios.

The lack of (or too-weak) safety & consistency guarantees will waste more time than it saved way before 5 years are up. More like within 3-6mo of going to production, if not before you hit production. Even more true if you're dynamic the whole way up (JS-to-JSON-to-[Ruby/Python/JS]-to-NoSQL and back again).

With the usual caveat that yes, some datasets & workloads probably do benefit from certain NoSQLs, but that's a pretty small minority.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact