
Modern Data Practice and the SQL Tradition - MarkusWinand
https://tselai.com/modern-data-practice-and-the-sql-tradition.html
======
danso
I've personally grown to love SQL and I think it is by far the clearest (if
verbose) way to describe data transformation work. I learned traditional
programming (e.g. languages like Python) long before I stumbled on SQL, but
I'm not sure I could've understood Pandas/R as well without having learned
SQL, particularly the concepts of joins.

That said, my affection for SQL correlates with my overall growth as a
software engineer, particularly grokking the concept of piping data between
scripts: e.g. writing a SQL script to extract, filter, and transform data, and
then a separate Python/R script that creates a visualization. I think SQL
seems so unattractive to data scientists (i.e. people who may not be
particularly focused on software development) because it seems so alien and
separate from the frameworks (R/Pandas) they use to do the actual analysis and
visualization. And so it's understandable to just learn how to do it all in
R/Pandas, rather than a separate language and paradigm for data wrangling.

~~~
revscat
> I've personally grown to love SQL and I think it is by far the clearest (if
> verbose) way to describe data transformation work.

I'm going to go against the grain here, at least as expressed in this thread,
and say something that I think is defensible, but...

SQL is a terrible language.

1\. It's old. There have been many, many advancements in programming languages
since its inception in the mid-80s, none of which have made it into the
language itself.

2\. SQL is naturally tedious, with few mechanisms for isolating common
functionality into e.g. objects/interfaces/functions/monads. Stored procedures
exist, but still suffer from the other shortcomings of the language.

3\. There's no vendor-agnostic way to write unit tests.

4\. The code itself is ugly. There isn't a standardized (or even semi-
standardized) way to confidently format code, compared to languages like
Python or Java. I've seen (and very occasionally written) code in languages
like Ruby or Java that is elegant to the point of being artistic. SQL is
almost universally ugly.

5\. Over time SQL tends towards becoming unmaintainably complex, for reasons
I've come to believe are endemic to the language itself. In my experience and
almost without exception, SQL has been the source of the most complex and
difficult to maintain/test code in every codebase I've worked with. Views that
are 300+ lines long, with multiple layers of inner joins, are common. And no
one wants to touch these monsters because it's almost inevitable that
refactoring will break something way over yonder.

6\. There's no type checking.

7\. There's no standardized library sharing framework like RubyGems or
Python's pip.

8\. IDE support is limited, and is mainly limited to basic code formatting.
Compare this to something like IntelliJ where you can just ctrl-click on a
method or variable and go to the definition. Even TypeScript is better here.

9\. Refactoring is dangerous, and the ability for your tools to lend a hand is
limited. Want to rename a public method in Java? Right click. You can be
pretty confident doing so won't break your app (with caveats). If you want to
rename a column in SQL, you're not going to have nearly that amount of
confidence.

Now, I think there is a place for something _like_ SQL: fast, able to easily
access and update large amounts of data, etc. Something that executes directly
against an RDBMS.

I just wish it wasn't _SQL_.

~~~
tannhaeuser
I hate point-by-point rebuttals, but here goes mine anyway ;)

> _1\. It 's old. There have been many, many advancements in programming
> languages since its inception in the mid-80s, none of which have made it
> into the language itself._

It's even older than that (around 1979), but being old isn't an argument in
itself.

> _2\. objects /interfaces/functions/monads, Stored procedures_

Not the job of a database

> _3\. There 's no vendor-agnostic way to write unit tests._

As compared to alternative NoSQL database code that isn't portable at all?
Agree though that the testing culture in SQL land could be improved.

> _4\. The code itself is ugly._

Entirely a subjective matter. Programming != poetry.

> _5\. Over time SQL tends towards becoming unmaintainably complex_

If queries are complex in a language that is already very compact compared to
equivalent procedural code, then chances are they're complex because the
business problem is irreducibly complex, and another langauge won't save you
here.

> _6\. There 's no type checking._

Of course there is! You can't insert character data into number columns for
example (SQLite handles this a bit different though), and will receive proper
type-related error message on eg. date functions.

> _7\. There 's no standardized library sharing framework like RubyGems or
> Python's pip._

What exactly would a stdlib (of table definitions?) be useful for? The
standard is the SQL language itself. Maybe the state of portable SQL scripting
could be improved by eg. lifting the syntax of eg. Oracle SQLPlus, also
implemented by DB/2 since a couple years, or another syntax into the ISO SQL
standard.

> _8\. IDE support is limited, and is mainly limited to basic code
> formatting._

Last I checked, IDE support for SQL was quite good on Eclipse, including
autocompletion. When you assemble SQL from strings in your app, there's a
limit to what an IDE can do.

> _9\. Refactoring is dangerous_

SQL/relational algebra has a very powerful construct known as "views" \- a
consistent relational interface for your apps to work against. Refactoring is
as good or bad as you make it to be.

~~~
matwood
I think SQL works well for its problem domain and agree with everything you
said.

> It's even older than that (around 1979), but being old isn't an argument in
> itself.

I wanted to add an additional comment on this point. What is it with old
automatically equalling bad? I definitely appreciate where improvements can be
made, but I think as an industry we incorrectly think there is something wrong
with an old/mature technology. Throwing things away just because they are old
hinders progress since we're constantly rewriting tested, working code. I'll
get off my soapbox now.

~~~
revscat
> What is it with old automatically equalling bad?

That's not what was said. What was said that it's old and that it hasn't
benefited from decades of research on languages and the way programmers are
most productive.

------
commandlinefan
Not to step on anybody’s toes, but… I often suspect that there are lots of
people who tried, and failed, to learn relational database techniques and
gravitate toward schemaless solutions like Mongo just because they’re easier
to understand. Maybe not everybody, but more than a few that I’ve interacted
with.

~~~
michelpp
They're superficially easy to understand, but end up moving the complications
of concurrency, transactions, and statistical query planning up into the
application. Those are much harder problems to solve correctly than just
learning SQL and understanding the output of EXPLAIN.

~~~
HelloNurse
It's a common ignorance pattern: hipsters reject complex and mature effective
tools that require a learning investment (in this case a RDBMS and its fancy
configuration options and SQL capabilities) because they don't know what they
can do, and embrace inadequate familiar and/or shallow tools (in this case,
misused fashionable and "schemaless" NoSQL databases) because they are
confident that they can fill the gaps (in this case application code in
familiar programming languages, to the extreme of "reinventing the wheel" with
completely custom frameworks).

~~~
commandlinefan
> hipsters reject complex and mature effective tools that require a learning
> investment

I think there is an element of that, but let me suggest another possibility:
this is driven by modern (so-called “agile”) project management techniques.
Every week or so, the programmers are called on the carpet to provide
“estimates” for a series of a dozen or so vaguely-worded “tasks”, each of
which does have clear business value but is pretty open-ended about acceptance
criteria. These estimates can only be on the order of a few hours: if the
estimate goes higher, the project manager insists that you “break it down”
until the individual subtasks can be estimated on the order of a few hours.
This “breaking down” process, likewise, is expected to only take an hour or so
- if it takes longer, you’re going to be answering for it next review period.

So you have a “task” that looks like it could use a relational database, but
you don’t really know relational databases that well. You’re not an arrogant
narcissist, so you recognize that something that takes other people weeks or
months to learn effectively is probably going to take you weeks or months to
learn effectively, as well. You _do_, however, know how to stuff documents
into a document repo, and you can estimate that, and the estimate is within
the acceptable four-hour maximum that the project management group that is,
for some reason, now running the show, permits.

Even if you’re supremely dedicated and spend your off time studying (assuming
you don’t have a family and _have_ any off time), you still have to come up
with something RIGHT NOW so maybe after a few weeks of evening study you might
have been able to produce something better, but the idiotic “sprint” ends on
Friday, so you stuff it into a key-value store, mark the task complete, and
spend orders of magnitude more time working around the problems that were
introduced by myopic, shallow-thinking project managers than you would have
spent actually putting together a decent solution.

So there’s that.

~~~
catalogia
From what I've seen and personally experienced, the sort of inexperience-
masking hipsterism he's describing starts earlier. In university or
highschool, years before Agile or other management styles become relevant to
the individual.

------
0xFACEFEED
The author of this essay is beating a dead horse. "NoSQL" (I really dislike
the term) has proven its value over the years and deserves a seat the table.
People will always misuse technology or implement it poorly, but I don't think
it warrants yet another oversimplified "SQL vs NoSQL" rant.

~~~
ramraj07
Genuinely curious since I have only seen bad things said about NoSQL in most
places over the years, what benefit does it provide other than scaling?

~~~
dodobirdlord
It's unlikely that a company needs to invest in data scientists or even
thinking very hard about their data organization until the scale of their data
is already pushing the bounds of what most RDMMSs can handle. NoSQL is nice if
you are planning for scale, since it will seamlessly get big without much
thought or any significant changes to the performance. There are no "gotchas"
that will cause very long-running queries or that will lock a huge number of
rows, so performance is very good and most importantly very stable. This is
largely due to the fact that you have to think more deeply about your data
access plan up front, since almost any query is a table scan (which, ideally,
never happens). I think that this is secretly a benefit in that it forces
people not to perform ad-hoc queries on databases, and to think of their
databases in terms of the APIs that they have built over their databases,
because the databases are not going to efficiently support any other sort of
access than the access plans included to support those APIs. I would lean in
favor of a NoSQL database to back a production networked service because the
upsides are helpful in this case (stable performance, easy to scale) and the
downsides are not significant (have to plan your APIs up front -> going to do
that already, no ad-hoc queries -> not going to run ad-hoc queries on a
production database anyway).

~~~
steve_gh
> It's unlikely that a company needs to invest in data scientists or even
> thinking very hard about their data organization until the scale of their
> data is already pushing the bounds of what most RDMMSs can handle.

I work in an engineering data science consulting group for a major UK PLC.
Almost all of our clients have mainly SQL based data sources, and none of them
really require cloud scale. In fact I don't think I have come across a client
whose data I can't handle in SQLite on my laptop.

However, the problems we solve do require data scientists - mainly STEM MScs
with a smattering of PhDs who are capable of combining statistics, domain
knowledge and the ability to work within a variety of organisations and
present.

~~~
ellimilial
Seconding that. Several ‘AI’ startups I worked with, do not require a massive
amount of data to create MVPs/first releases.

The do however invest early in sufficient data science functions to have a
good liaison between the researchers and data acquisition.

------
gfo
I've been working with SQL a lot in my job lately. For what it's worth, I'm a
big fan.

It seems the central argument here is NoSQL doesn't force you into good design
habits so it's overrated. I'll concede this is partially true from my
perspective because much of my work involves trying to sanitize, transform,
and otherwise refactor poorly structured or designed NoSQL datasets.

But I've also seen my fair share of SQL databases which are poorly designed,
don't use features which are meant to benefit developers (I haven't seen a
Foreign Key smartly implemented in a LONG time). It's not really fair to say
NoSQL has encouraged poor design practices; from my experience it seems like
data model implementation is given little effort in general.

NoSQL takes the 'training wheels' off data model implementation where SQL is
like keeping them on, but even with them you can still fall off the bike if
you aren't careful though it's much harder.

~~~
lgl
>I haven't seen a Foreign Key smartly implemented in a LONG time

This has also been my experience as a web developer mostly. I rarely see any
applications that make use of foreign keys constraints supplied by the
database server. Usually I see relations being handled by code on the
application side.

Even when I build apps that are using SQL, I always implement this stuff in
code instead of FKs even though I know how they work and when designing a
database schema in an app like (for example) mysql workbench, I have the
option to add them easily. But I always use the "ignore foreign keys" option
and then implement the constraints in code. I just find it a bit more sane to
have all of the logic inside the app.

I know I'm probably "doing it wrong" but would love to hear some other
opinions about this from the hn crowd that does web development. I'm guessing
that more "enterprisey" apps like CRM's and ERP's will probably use more of
the native database stuff.

~~~
mpartel
Application-level checks are prone to race conditions, especially at the
default transaction isolation levels of many popular SQL databases.

For example: transaction X deletes a parent row and its child rows while a
concurrent transaction Y inserts a new child node. Without a FK, this can
leave an orphaned child row.

For similar reasons, SELECT before INSERT is generally not a safe replacement
for a UNIQUE constraint.

You don't even need that high a user volume to see these errors in practice.

It's a shame that ORMs don't tend to encourage the use of database-level
constraints, but even if I just want to build something quickly in e.g. Rails,
I still add the constraints as an extra check because I'd rather get the
occasional 500 than an inconsistent database.

------
vkazanov
I's nice that the article mentions Codd and his relational model of data but
what it doesn't mention _is how badly SQL parrots relational algebra_. The
language is inspired by the idea ("based on a real story"(c)), yes, but it
takes a really clean and sound model and makes an unbelievable mess out of it.

SQL is just an ugly historical accident. Unfortunately, this how it often
works...

NoSQL are a different story, of course. BTW, I believe that they predate
Codd's work. There were many examples of non-relational DBs in the 70s.

~~~
commandlinefan
> how badly SQL parrots relational algebra

I’m continually amazed that we’re STILL primarily interacting with databases,
programmatically, by passing SQL statements as strings into them, rather than,
say, passing in a list of column names as an array, and a list of conditions
as tuples, etc. ORM’s try to bridge this gap, but they do a pretty bad job of
it; they end up oversimplifying so much that the result is unreasonably
inefficient.

~~~
okennedy
The difficulty with the approach you propose, and by contrast one of the major
strengths of SQL, is composition. Passing a list of column names and a list of
conditions lets you express precisely two concepts: Filtering and Projection.
You could also add more: A set of tables to join together, an "aggregation"
version of the same operation, etc... Going down this path, however, leads to
a monolithic function that becomes progressively harder to generalize.

What relational algebra (and by extension SQL) gets "right" is that each of
these operations (Projection, Filtering, Join, Aggregation, Union) are
composable: They take 1 or 2 collections as input and produce a collection as
output. Moreover, each operation has simple and well-defined semantics from
which you can build much more complex logic.

That's not to say that Relational Algebra can't be built in to an imperative
language. Scala (and by extension Spark) collections are a great example of
composable operators at work. Ruby's array methods, Python comprehension
syntax, and Pandas/NumPy are similar examples of simple, composable primitives
that combine to be much more powerful data transformations.

Apart from RA-based language primitives, there's also compiler support that
allows you to use SQL directly, but avoid passing strings around at runtime.
.NET's LINQ is a great example. I'll also pitch one of my own projects,
DBToaster ([https://dbtoaster.github.io/](https://dbtoaster.github.io/)),
which compiles view maintenance queries down to a C++ or Scala class.

In short, I agree that passing strings around leaves performance on the floor
and leads to escaping and code injection nightmares. But SQL is the
culmination of literally decades of use-based design, and any effort to
replace it needs to take care to understand what it does well _and why_ (like
the efforts I reference above)

------
exabrial
The people that bemoan SQL are like the people that bemoan staff notation.
Every [music|data] student thinks they can do better when they first learn it
instead of embracing the though that's gone into it over many years.

~~~
WhompingWindows
Interesting analogy, the problem is staff notation has been used by every
basically major composer for hundreds of years. SQL, while important, is far
less universal and far more complex. In the end, staff notation is more like
writing than it is like SQL. Writing and staff notation both have a limited
number of characters with an insane number of possible combinations. With SQL,
the main challenge lies in understanding the underlying data structures, not
the declarative symbols of SQL itself.

------
cube2222
I don't agree with the vilification of NoSQL, but I do agree that SQL is a
great query language.

That's partly why I wanted to create a tool to query various databases (NoSQL
ones or files too) with SQL. We're still in an early stage with OctoSQL
regarding the variety of databases, but take a look if that sound appreciable
to you:
[https://github.com/cube2222/octosql/](https://github.com/cube2222/octosql/)

~~~
chrisjc
What sets OctoSQL apart from the existing options such as Apache Drill (even
Spark SQL for that matter) or future projects such as PartiQL?

[https://partiql.org/](https://partiql.org/)
[https://drill.apache.org/](https://drill.apache.org/)

~~~
cube2222
We're aiming to have very ergonomic stateful stream processing with only SQL
and we're working on it currently. That's basically what's meant to set us
apart.

~~~
chrisjc
So tapping into the change-logs of the underlying data-sources and providing a
stream processing layer that's expressible in a stream SQL dialect?

btw, not being critical of your project, just trying to understand it.

~~~
cube2222
Mainly thinking of explicitly stream oriented data sources like kafka, but
yeah, change logs are really solved with an analogous abstraction.

There's a great paper on that: "One SQL to Rule Them All", check it out.

We also want to scale well from single computer one-of data exploration
queries, to full blown clustered long-term stateful stream processing.

The point is to provide a well thought out SQL interface to as many data
sources as possible, and like drill does, push down as much computation as
possible.

We actually learned about drill only after creating OctoSQL, but that's
another story. (We're definitely less mature currently and support fewer
datasources)

~~~
chrisjc
Ha! I was going to link that to you once I understood what you meant.

[https://news.ycombinator.com/item?id=20059006#20062821](https://news.ycombinator.com/item?id=20059006#20062821)

I believe there is a similar effort going on in the Flink project. They're
slowly merging in work from the Blink project to provide a unified SQL
paradigm (batch + stream).

~~~
cube2222
Yup, Akidaus book is a great resource too.

We're basically aiming for a middle ground between Flink, Presto and Drill.

------
asdfman123
Coming from the boring enterprise world, where we didn't really get swept
along in the NoSQL hype, this doesn't seem like it should remotely be a
surprise to anyone.

"Hey, turns out there's lots of great ways to use SQL!" Yeah, we know, they've
been at the core of our business for the last 20 years at least.

Lots of enterprise shops started out as SQL databases to keep track of all the
data, and web apps grew up organically around them.

------
cty
SQL is a functional programming language. No other imperative programming
model will be cleaner or clearer in expressing intent. But, you have to
understand SQL to begin with.

~~~
codetrotter
Did you mean to say declarative? That’s what makes it clear in intent isn’t
it? That it’s declarative. Whether or not SQL is also functional is orthogonal
to that isn’t it?

~~~
chasd00
further, afaik there's no assignment nor iteration in SQL. it's not a
programming language at all. It's a.. query language.

~~~
yellowapple
I'd consider INSERTs and UPDATEs to technically be "assignment", even if
they're very different from how other languages do it.

Some SQL dialects do support both traditional variable assignment and
iteration for those cases where an iterative/imperative approach makes more
sense than trying to shoehorn the problem into something set-based /
declarative. Some limit them to stored procedures (e.g. Postgres, and AFAICT
Db2), while others allow them pretty much anywhere (e.g. SQL Server / T-SQL).

------
zzzeek
Give the data scientists SQL and relational algrebra.

But please don't give them stored procedures and triggers.

~~~
will_pseudonym
(genuine question) What are the best alternatives to triggers? And what makes
them a bad idea?

I'm pretty much with you on stored procedures.

~~~
Ididntdothis
I am sure there are good use cases for triggers but I have seen quite a few
databases that used triggers a lot and it almost always felt like the
equivalent of spaghetti code. Things are happening and it takes forever to
figure out why they are happening.

~~~
BareNakedCoder
Depends on your perspective. Yes, if you are an application developer with
weaker skills to access the data integrity logic in the database. No, if are a
skilled database guy with weaker knowledge of all the source code of all the
applications (could be multiple) sharing the database. Things are happening
and it takes forever pouring thru each app's code to find its data integrity
logic to figure out why they are happening.

By centralizing data integrity logic in the database, you know where to look
and that all apps using the database will abide by it.

~~~
Ididntdothis
I agree with you but in the cases I saw I felt that the triggers were used to
fix problems in the code more than being part of a consistent data strategy. I
admire well designed databases but unfortunately there aren’t too many of them
out there.

I think part of the problem is that there is still this huge chasm between
good coding skills and good database skills. It’s hard to have both.

------
Dowwie
I've worked extensively with SQL and relational data modeling. It's been very
useful for accomplishing my work. I haven't come across better tools and so I
haven't adopted alternatives. If the time ever comes where there is truly a
superior tool set to the one I am currently using, I will gladly stop using
antiquated technology. Newness isn't sufficient to sway me. No dogma here.
Just pragmatism.

------
acd
Data structures still matters and very much so! If you run in the cloud data
structures has to be very efficient! Data should be normalized. MySQL has
built hash table support adaptive hash indexes.

I think one should also not over complicate the data layer.

Boring tech, I like the safety C D of SQLs ACID.
[http://boringtechnology.club/](http://boringtechnology.club/)

~~~
nomel
> Data should be normalized.

This is a rule of thumb, only, and depends on your definition of efficiency
and your queries. If you normalize everything, especially with some analytics
queries, you will quickly find silly barriers to query times caused by all of
the required joins.

~~~
matwood
Sure, but now we're getting into higher level system design. The schema used
to run the operational system may/will be different from the one used for
reporting and analytics. This is what data warehouses were born out of with
the flattening of data, star/snowflake schemas, dimensions, etc...

------
kureikain
> On the other hand, if you check Postgres’ configuration file, most of the
> parameters are straightforward and tradeoffs not so hard to spot.

The artcile also refer to MongoDB as hard to config It maybe a matter of
taste. I found MongoDB document is way better than Postgres. They had thing
like this:

[https://docs.mongodb.com/manual/administration/production-
ch...](https://docs.mongodb.com/manual/administration/production-checklist-
operations/) [https://docs.mongodb.com/manual/administration/analyzing-
mon...](https://docs.mongodb.com/manual/administration/analyzing-mongodb-
performance/)

Which I can easily follow and apply and they are action-able like set noatime
on fstab, use XFS, max file handler etc.

For Postgres
[https://www.postgresql.org/docs/12/index.html](https://www.postgresql.org/docs/12/index.html)
I cannot easily find something similar to MongoDB one.

------
lukev
Good article, but it and a lot of the comments here seem to be conflating SQL
as a language and Postgres (or other RDBMS implementations.)

For the record, I like both these things and they often go together well. But
once your data gets too big for Postgres, you don't have to immediately jump
to NoSQL: modern distributed SQL tools like Presto are quite good and can let
you continue to use the same SQL interaction patterns, even at the terabyte or
petabyte scale.

You have to be a _little_ more aware of what's going on under the hood, so you
don't write something pathological, but it's quite a powerful approach.

I am even using Presto for a lot of what would normally be considered ETL:
using SQL to select from a "dirty" table and inserting into a "clean" table on
a schedule.

~~~
dswalter
And presto's _rich_ functionality for complex data types (maps and arrays) as
well as window functions makes some pretty challenging things possible.

------
barrkel
Start with Postgres. Don't start with SQLite. SQLite is a file format, not a
database; it scales atrociously (I've seen simple queries run for 10s of
seconds with 100MB of data), it basically doesn't work in concurrent update
scenarios, and the more data you put into it, the more pain you'll have
migrating to Postgres.

Use SQLite if you want a file format where you don't want to rewrite the whole
file for every update, or if you're in a situation or environment where it's
not feasible to have an actual database server.

~~~
ellimilial
Sure if your use case is an analysis of interconnected entities and you either
work alone or have someone to help you maintain a shared server.

Otherwise please don’t. I earn my living (amongst other things) due to
organisations that went with this as blanket approach.

Use a right tool for the job. SQLite is fantastic for small to medium size
datasets, shines in immutable case. Plus its maintenance and sharing cost is
close to zero(something you will cherish once docker comes to play or if you
want to learn or test some SQL without going through a pain of setting local
permissions for your schemas).

------
psv1
As someone who has to use Elasticsearch as an _only_ data store, yes.

~~~
licnep
Can you elaborate on which issue(s) you encountered? We are considering
Elasticsearch at the moment for document storage and search, versus postgres.

~~~
qohen
FYI, you might want to check out ZomboDB[0], which integrates Postgres and
ElasticSearch. It's open source[1] and, fwiw, the developer was helpful when I
pinged him with some questions a while back (and is available for consulting
services).

From the project's github page[1]:

 _ZomboDB brings powerful text-search and analytics features to Postgres by
using Elasticsearch as an index type. Its comprehensive query language and SQL
functions enable new and creative ways to query your relational data.

From a technical perspective, ZomboDB is a 100% native Postgres extension that
implements Postgres' Index Access Method API. As a native Postgres index type,
ZomboDB allows you to CREATE INDEX ... USING zombodb on your existing Postgres
tables. At that point, ZomboDB takes over and fully manages the remote
Elasticsearch index and guarantees transactionally-correct text-search query
results._

[0] [https://www.zombodb.com/](https://www.zombodb.com/)

[1] [https://github.com/zombodb/zombodb](https://github.com/zombodb/zombodb)

------
ComodoHacker
One point got me curious:

>As a rule of thumb, vertical scalability is in most cases more economical
than the horizontal one.

Isn't _the opposite_ the reason why we started to scale horozontally in the
first place?

~~~
alexhutcheson
No, it's because projects with huge amounts of data were growing beyond the
limits of what you can reasonably do on one machine. Machines were less
capable then (smaller disks, less memory), and those limits are a _lot_ higher
now. If your data is small enough to fit on one (very beefy) machine, then
it's probably still cheaper to pay for that high-end machine vs. distributing
to a bunch of less capable ones.

There are exceptions - distributing the data can be really helpful if you need
to do a lot of bulk I/O (ETL jobs, analytical queries, etc.), but it comes at
the cost of making "transaction" use-cases difficult and expensive. Using a
scaled-up OLTP[1] database for user interaction and a scaled-out OLAP[2]
database for analytics and ETL jobs is a common pattern.

[1]
[https://en.wikipedia.org/wiki/Online_transaction_processing](https://en.wikipedia.org/wiki/Online_transaction_processing)

[2]
[https://en.wikipedia.org/wiki/Online_analytical_processing](https://en.wikipedia.org/wiki/Online_analytical_processing)

------
maxdemarzi
Querying data in “Cypher” is so much easier. I have 20 years as a database
developer, cypher is way better. I can live code 50 complex queries in Cypher
before I get one done in SQL.

~~~
mycall
What is Cypher?

------
laichzeit0
It’s best not to take criticism seriously from people who have no formal CS
education and can’t even define the premises of relational algebra. Ah well.

~~~
WhompingWindows
Do you have any particular critiques of the author's content? Otherwise, this
feels like gate-keeping, since you provided no specific critiques of any
content.

~~~
laichzeit0
Not the author, I’m referring to the people that drive the author to have to
make a post like this at all.

------
jackschultz
Alright here's a relevant question I've been having in terms of this. Let's
say I have the code to gather / scrape / load some stats into postgres, but
then want to run projections on them.

For example, if I'm trying to predict the next day's stats by using stats in
the past by simply taking the average, how many days in the past should I look
at results and take the average of? Is the last 3 day average the best? 5
days? 8? 10?

There are clearly a couple ways to do it. One is by getting a data frame of
all the stats for all the objects, write the python logic to loop through the
days, get back the stats from the past X days not including the current day,
taking the average and then storing that back to postgres in a projections
table. A function like set_projections(5) where 5 is the number of days in the
past I'm taking the average of.

Second way to do this is write that function as a plpgsql function where uses
subqueries to find the past X day stats for the players and then creates or
updates the projections table all in sql so we can run `select
set_projections(5)` and that'll do it itself.

So the question becomes, which ones is "best"? I have to imagine it's mostly a
case by case basis. Since it's only me here, I've been doing it in postgres
alone since it can be done in one query (with multiple sub queries, yes), but
that's it. With python, it'd involve many more steps. On the other hand, the
sql looks huge and then I've been running into the issue of do I split some of
the sub queries into sub functions since that's what I'd be doing in python?
If there were more people involved, would it be bad to have larger cases like
that in postgres since we wouldn't know the skill of the others, where mostly
they'd be coders and could write the projections in languages they'd want?

Another example of this tradeoff is how should I interact with the database? I
have a rails background, and ActiveRecord is an incredibly good ORM, whereas
SqlAlchemy hasn't done it for me. In either case, there's a ton of overhead to
getting the connections running with the correct class variables. So instead,
I kind of created my own ORM / interface where I can quickly write queries on
my own and use those in the code. This is especially easy since most the
queries can be written in postgres functions so the strings of those queries
is incredibly tiny!

What I've learned from this project I'm doing is that sql is very, very
powerful in what it does. I've shifted more to using it for actions than I
would have in the past, and pretty much all thinking I do is making as little
code as possible.

Anyone make it through reading this and have comments about what I'm doing and
what they like to do?

~~~
msluyter
It sounds like you're calculating a running average. Have you looked into
window functions? I think they do exactly what you need. Something roughly
like:

    
    
      SELECT AVG(stats.val)
                OVER(ORDER BY stats.date ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS avg_val
      FROM stats
    

Regarding this point:

 _What I 've learned from this project I'm doing is that sql is very, very
powerful in what it does. I've shifted more to using it for actions than I
would have in the past, and pretty much all thinking I do is making as little
code as possible._

It's an interesting tradeoff. I've been on the other side, in an environment
that was super SQL heavy (also Oracle PL/SQL) where we had to deal with these
huge complex queries, and it was incredibly painful, for various reasons. One,
SQL by itself isn't that composable in the way code is[1], so there tends to
be a lot of copy/paste/duplication. Two, giant queries can be hard to
understand/debug. Three, SQL can be hard to test. I mean, you can set up
integration tests against a database (which can be time consuming), but how do
you unit test a query?

On the other extreme, I've seen cases where something separately queries two
tables and does a join operation in code, which would be much easier/safer
just by doing a sql join. So there's a tradeoff. There's a hard to define
sweet spot somewhere in the middle, IMHO, where you take advantage of
relational semantics/joins, and utilize various sql features like group bys
and whatnot, but where you haven't shoved _all_ of your logic into the
database layer.

[1] You can create CTEs (with foo as....) in postgres, but they may have
downsides wrt performance (at least, last time I'd read up on it they posed an
optimization boundary). You can also create views, but similar issues may
arise.

~~~
cldellow
Re the CTE optimization boundary comment: this changed in Postgres 12! By
default, they are no longer a boundary. You can change the behaviour on a per-
query level, too. As always, depesz has an article going into the details:
[https://www.depesz.com/2019/02/19/waiting-for-
postgresql-12-...](https://www.depesz.com/2019/02/19/waiting-for-
postgresql-12-allow-user-control-of-cte-materialization-and-change-the-
default-behavior/)

------
QuadrupleA
I might be under-informed but "data science" seems to involve a lot of vague
BS. Computing has always centered on data.

"Data practitioners", "data practice", "data management" just seems like weird
rebranding of stuff that businesses have been doing since the 1960s. Partly
what the article seems to be relating regarding SQL.

What is "data science" besides computing, data storage of some sort, and
analyzing the data? Because it's "BIG" now? Because now we "extract knowledge
and insights" from data, since apparently giant databases were amassed for no
reason in the past? Because now we "combine multidisciplinary fields like
statistics", since apparently nobody had thought to run statistics on their
data before? Because "AI"?

~~~
ramraj07
A good data scientist is a jack of all trades - mediocre programmer, mediocre
ML modeler, mediocre ETL architect, mediocre statistician and a mediocre
analyst. You hire this person because you don't need a specialist in each of
these fields But you need someone who can do these things. Also you don't want
to pay a metric ton of money so you can't expect some superstar.

~~~
EForEndeavour
> Also you don't want to pay a metric ton of money

Doesn't the job title of "data scientist" still correlate with high pay, or
has the hype started to wane?

~~~
ramraj07
A data scientist doesn't command the same premium as a machine learning
engineer or a kubernetes expert, I'm afraid

------
patgrdj
RDBMS can be replaced by Impala, Spark SQL, Drill, Presto, or any SQL engine
on top of Hadoop.

------
hypfer
Hey you've scraped my email from GitHub and sent me this link in some kind of
newsletter

I don't think this is GDPR compliant.

------
dsstudent
feedback tha paroume?

------
_pmf_
The damage SQL has done to the relation of perception of he general public
towards the relational model cannot be undone. The relational model is
beautiful; it's in no way more complex than objects and attributes, but SQL
makes it seem so by conflating orthogonal aspects into it.

~~~
ergothus
Care to elaborate? What you've said resonates, but I lack non-sql relational
understanding with which to really evaluate.

~~~
XuMiao
Relational algebra didn't evolved into a relational programming language. SQL
is merely a query language. The recursion is horribly done and there is no
type systems. Imagine this, Friends(a: Person, b: Person) defines the
relationship and the foreign keys at the same time. It makes the reasoning
easier too. mary.Friends.Friends get all friends of friends of mary who is a
Person. SQL requires you to write a lot of joins to achieve this. Error prone
coding experience. That is why there are ORMs which end up with half baked
solutions.

In fact, logic programming language and SQL should consolidate into a
relational programming language. Every thing we write as a program,
automatically supports persistent and distributed storage. It can also support
probabilistic computation to have machine learning involved. Then we will have
a complete data driven software solution.

Unfortunately, right now, we cook everything up with SQL, python, operational
DBs, analytics DBs, Spark, Tensorflow.

It could have been a better place.

------
innagadadavida
In the Hadoop world things have evolved to support SQL. Spark, Hive, Impala
all have full support for SQL. The Spark implementation is actually faster
than you doing low level RDD processing as there are optimizers that work very
well. In addition, you can create UDF that are easy to integrate.

The only reason you might choose other approaches is to make the problem look
significantly more complicated, thereby justifying more maintenance and
resources.

SQL is just too easy and some super smart engineers don’t like it because if
it. That said, NoSQL does have value in some corner cases where it could
perform better when most of the logic is simple lookups.

------
noobiemcfoob
“Why didn’t we use an RDBMS in the first place? “

Because initial application specifications are sparse and definitely wrong. If
your application is still up and running 5 years later and your data
definition hasn't changed much in the past 3, then maybe refactor around an
RDBMS. Designing around rigid structures during your first pass is costly.
This is why there's been a rise in NoSQL and dynamic languages.

~~~
asdfman123
I mean, we're all programmers here, and we've all dealt with the growing pains
of changing requirements.

But has an RDBMS _ever_ been a major source of that pain? I can't say I've
ever encountered a time when it has. If you need to change the structure, just
write a script.

I'm not saying it's completely painless, but nothing is.

~~~
shantly
IMO if your data model's still so nebulous that you're changing it so often
that SQL migrations are a serious impediment to progress, you probably don't
need _any_ real datastore yet. You can usually figure out WTF you're going to
do, broadly speaking, before persisting anything to a remote database. And if
you can, you very much should.

Yes, yes, there are sometimes exceptions, one must repeat explicitly despite
having already said it (see: "probably") because this is HN.

~~~
asdfman123
Can someone explain to me any circumstances where having no well defined data
model is better than coming up with a clear relational model? Honest question.
It seems like it would just be a huge hassle trying to deal with your
dissimilar data.

When I'm designing an app from scratch I often think about the SQL tables
first and how I'm going to build them, and it really sharpens my idea of what
my program will be.

I don't see how skipping that process would make things easier.

~~~
Scarbutt
You can't really design your tables well enough without knowing your UI, its a
back and forth, forth and back process.

~~~
asdfman123
And it's something you can do entirely on paper, too, before you go to code.

However I've never greenfielded an actually large application.

