Hacker News new | past | comments | ask | show | jobs | submit login
The Rise of SQL-Based Data Modeling (holistics.io)
192 points by huy on Jan 30, 2020 | hide | past | favorite | 125 comments



> This implies that SQL is not reusable, causing similar code with slightly different logic to be repeated all over the place. For example, one cannot easily write a SQL ‘library’ for accounting purposes and distribute it to the rest of your team for reuse whenever accounting-related analysis is required.

Data scientist here. I think some of this problem is handled by effective use of views. Oh, everybody is constantly joining these three accounting-related tables and aggregating by, say, order number? Have your Data Engineer/DBA/analyst/whoever create a view that takes care of that. Boom. Now everybody's using the same data, calculated the same way, nobody's reinventing the wheel, and you don't have to worry about somebody fat-fingering something when they re-write that query for the 10th time.

With that being said, I still think there's some truth to this criticism, in that it's not as easy/common to be able to build an abstract query that does a common operation on arbitrary data. You can't import trend_forecast.sql, hand it arbitrary time-series data, and generate an N-month linear forecast from your historical data points. At least, not easily in ANSI SQL.


imo any SQL that requires regression (I have implemented least squares in TSQL w/ recursive CTEs) is probably a bad use case for SQL in general - projecting, filtering, grouping, and sorting is what SQL engines are great at - repeating row access or referencing the previous row on the next row and then running a function is not going to be fast in most of the databases.

To the point on views - they can seem really useful, but many people dont understand that views dont eliminate "useless code" - most engines are going to evaluate everything in the view, and the nested subviews, and of course, the 10 depth views that were created because they are such a great abstraction.

This is where (to me) other non-database programming languages definitely come in - you can codify codegen or other methods that are locked in, not just a huge string and get a lot of the same value as your views (unless of course, everyone is querying the db directly.) If that's the case, then templating out code is still a good plan, but you can usually accomplish it via the client tooling people are using (most of the DB IDEs have pretty decent snippet support.)


Agreed on regression. You should probably be doing that level of analytical work in a stats package. Perhaps I should have chosen a better example to illustrate the "SQL doesn't facilitate creating abstract libraries that operate on arbitrary data" point.


I've found views views wildly useful for defining a single definition of a complex heirarchy defined by business rules. Especially for reporting in 'pre-packaged' software with a fixed table structure. Often in these generic CRM or ERP systems it's surprisingly hard to actually get at the root cardinally-1 data and it's 1-n categories.

My strategy for these systems is to first create a view that captures the complete business' desired cardinality (which may be surprisingly complicated), and all the reporting views and stored procedures start with that view as the first table to join off of. Ideally you can implement it with left joins in such a way that allows you to use it everywhere you need any part of the heirarchy and the SQL engine will trim the parts that you don't use in each particular query that uses it.

With this you get the abstraction / reusability that enhances productivity and you get more reliable & reproducible results everywhere. It's much more likely that cross checking to separate reports works when everything is starting with the same cardinality.


Are you saying you build a completely de-normalized View of data? Could you please provide a small example?


My problem with views is that you do the join before the filtering.


Any query optimizer worth its salt will push the filters below the join.


I'm curious, what DBMS are you using whose optimizer can't see through views?

For what it's worth, MSSQL, PostgreSQL and Oracle don't have that limitation.


Bigquery.

If a make a view that joins table a and b, and I query that view with a filter, bigquery won’t push the filtering down unto a and b and then join.


Ah. That's not too surprising, then, is it? Bigquery's column-oriented, so I imagine efficient row-selective queries isn't really what it's for in the first place.


Predicate pushdown is one of the first optimizations implemented in SQL planners and all the OLAP and OLTP solutions that I'm aware of already have this feature in place.


One of the things I'm constantly baffled by given the growth of data science and analytics is that data modelling isn't treated as a first class concern. It's absolutely fundamental to doing quantitative work efficiently at an organization with any amount of complexity, yet the majority of people in the field seem to be unaware of the concepts.

This ignorance is especially surprising given that it's essentially a solved problem (Kimball), yet if you talk about data modelling, people usually think regression, not schema.


I agree that data modeling is underestimated but it hardly can be considered a solved problem. It is very hard because there are numerous alternative understandings and formal defs of what we mean by data (RM, 00, OR, MD etc.) In additiin, there are several levels of representation (physical, logical, semantic). In real projects, they all are mixed.


Could you elaborate on why you think data-modeling is a solved problem? It’s seems like an opinion warranting more than a one word explanation.


I'm guessing that the implication was that Kimball's dimensional modeling techniques (so, in a nutshell, snowflake and star schemata) will get the job done in almost any case.

I'm not sure I would 100% agree with that - e.g., denormalization, while useful for many things, isn't always your best option. But I would say that that there are a lot of tools in the box, and that is absolutely one of the critical ones to know.


> denormalization, while useful for many things, isn't always your best option.

From my view, it is generally not a good option for cases it wasn't designed for, an example being non-analytical reporting. If you are running operational support, getting the source data immediately and aggregating/displaying can be more helpful than modeling for analytics workloads. The line between these is blurred in most orgs. To the OPs point, data modeling seems like a sidenote in most analytical discussions. You can accomplish a lot using the star model which is essentially:

Prepare things to be fast by sorting them into proper groups (fact/dim/bridge)

Rely on ints

Store atomic data

Provide summaries/aggregates

Model after the questions you ask; not the system it comes from


I think Kimball's approach stumbles a bit because it needs bitemporalism to rescue it from its own temporal hackery.

But for a lot of things it works pretty well at what it does. It's useful to have a body of work formed over many years, tested in many demanding scenarios, providing some sort of structured guidance. Twitter and blog posts and random tinkering don't rise to the same level.


" Up till a few years ago, the traditional way of managing data (in SQL-based RDBMSs) was considered a relic of the past, as these systems couldn't scale to cope with such a huge amount of data. "

I must have missed the boat on this one. I remember in 2010 there was a brief period of time where NoSQL was in fashion, but it rightfully died pretty quickly to a small set of specialized use cases. There have been some cases where big data systems have replaced more traditional rdbms systems, but now you can use SQL for those too (like Hive SQL).

SQL is the one skill that has not become obsolete in the course of my career. Frankly I've started relying on it more, because it never goes obsolete. Also it's fast as hell. When I first started my career I did C#, and the .net Framework 2 was fairly new. Since then WinForms, and WebForms have gone away. ORM's changed, Javascript changed. Then I moved to Ruby, and Python, and PHP. Those ecosystems have evolved too. But the one thing that I learned 15 years ago, that I still use every day is SQL.


> I must have missed the boat on this one. I remember in 2010 there was a brief period of time where NoSQL was in fashion, but it rightfully died pretty quickly to a small set of specialized use cases. There have been some cases where big data systems have replaced more traditional rdbms systems, but now you can use SQL for those too (like Hive SQL).

A few weeks ago I had a lead server-side architect call me up and ask if my client could tolerate eventual consistency. My answer was that "he could get away with that 10 years ago but that NoSQL went out of fashion and SQL came back in again."

To put it in a different context: Some people just go through periods in their life where they don't keep up and don't realize what was "new and exciting" 10 years ago turned out to be smoke and mirrors.


What's the connection between NoSQL vs SQL and eventual consistency?


NoSQL is a bad name that really doesn't have anything to do with SQL per se, but is a shorthand for non-relational datastores (no one calls non-SQL relational systems “NoSQL”), which, while they depart from the relational model in a number of different ways, typically include eventual rather than strong consistency to support liveness in a distributed context.


NoSQL in some cases trades off consistency for scalability (generally by sharding data, so propagating data across shards introduces eventual consistency).

As you can imagine, data that reside in a single shard don’t need to be propagated, so you can get strong consistency easily (e.g. fetching a single document). It’s when you start doing queries, joins, aggregations, etc. that you start encountering eventual consistency.


Mainly that most people who ask that question are using it as a lead-in to promoting a NoSQL product.


NoSQL is very much alive and kicking, it’s just past the peak in the hype curve. There is a FAANG in particular who is notorious for banning relational databases outright because of their maintenance and scaling problems for their use cases.

As others have said, if you operate at a certain scale and can get away without complex joins and aggregations (which you should avoid anyway if you’re operating a high availability service), then NoSQL is a great option. It’s easier to scale out (often you just turn a knob) and its behavior is simpler and easier to predict.


The phrase that comes out when someone proposes NoSQL is "we're not Google / Facebook."

Most applications don't need to run at "Google scale," thus they don't need the expense of relaxing ACID.


Of course, consider your use cases. There are good reasons to use NoSQL, and excellent reasons to use RDBMS. I use both professionally and appreciate the benefits of each.


What FAANG _bans_ relational DBs?


The nugget of truth was that joins and transactions are very difficult to implement in a performant manner in large scale, distributed systems. So it was possible to write very highly scalable databases if you give up on "SQL", which was a stand in for joins and transactions.

Since then, developers have been coming up with clever ways to implement distributed joins and transactions more efficiently and scalably on top of distributed database engines, and thus "the return of SQL".


Where can I learn more about these clever developments?


I find CockroachDB one of the most interesting systems adding distributed transactions, joins and SQL support on top of a highly scalable key value store engine:

https://www.cockroachlabs.com/docs/stable/architecture/overv...



The youngsters in my work don't really know SQL.

I am probably see as one of these old developers that refuses to get with the times (rather than seeing NoSQL for a step backwards that it is). Its this constant churn of bullshit that makes me want to do a different job after nearly 20 years (as well as the nonsense that is the hiring process).


This, I think, is the true cost of technologies like ORM. And it's hard to notice, because it doesn't show up right away; it's a generational thing.

It reminds me a bit of a paper I saw recently on The Morning Paper called Ironies of Automation: https://blog.acolyer.org/2020/01/08/ironies-of-automation/


Unfortunately for this data analyst NoSQL is alive and well. It is being rolled out to enterprise levels without the downstream impacts fully realized. This requires all sorts of ad-hoc ETL data methods which in turn puts more time and strain on budgets and resources. At least this is true in my organization.


I think there is much more non-SQL then the OP realizes. I have been protesting non-SQL, SQL is more powerful than most realize. Also the term "Big Data" is being used 99% of the time about data bases I could load into my ram but counted in millions. This then causes people to search for things and find non-SQL.


That said, every big data tool I've encountered in current usage supports SQL - very few people writing Map/Reduce jobs these days. SQL has become the lingua franca of data analysis.


> brief period of time where NoSQL was in fashion, but it rightfully died pretty quickly

That what is not standardized, dies


I totally agree with you. And SQL keep evolving


So long as fresh faced developers arrive on the scene, there will always be people looking to get around using SQL.


And database could scale up at this time: Pan-STARRS is one of the bigger database with PB


> Also it's fast as hell

compared to what? excel?


This quote irks me: "Any tool that relies heavily on SQL is off-limits to business users who want more than viewing static charts and dashboards but don't know the language"

The good business folks know SQL and aren't afraid of it. I used not to be sure of this until I worked in an organization where most PM types use SQL with comfort.


Yea I'm not sure what the alternative would be. You can do some drag and drop dashboard creation or export small data sets to excel, but at some point you gotta get to SQL if you want all of the expressability.


know SQL is kind of hard to describe. I also have business people that know SQL, yet don't use (or request, if they knew there are not) indexes. So we get this amalgamation of monstrous subqueries all hitting the same unoptimized fields, of course, blocking the database.

I created a mini-training for the onboarding process, so they could at least look for things without killing all our read replicas. But that’s just one step; some people are just not good at it.

Fear... fear is having everyone blocked because one person is killing all the replicas for them to work on.


That sounds more like an ops problem than a business one. I’ve worked with a number of a non-technical roles that dabble in SQL, and it’s almost always they are querying a data warehousing solution (e.g Redshift, Pig, BigQuery) that doesn’t support indexes to begin with and most times just have to live with the performance.

It’s not their job to optimize the database, and it’s no different from seeing that a dev pushed a new feature that causes bad performance as well


It’s not their job to optimize the database, but it sure is to recognize the basic bottlenecks in their scripts. My training addresses things they have to take into account, and how to approach us when they think they need optimization to what they are doing. It’s just that some of them don’t do it. I made it so it’s part of their onboarding process, so they should know who to contact for these things, yet some of them start literally complaining instead of contacting the team that could help them out.


The solution to this problem should be setting up resource quotas.

That person should reach its quota limits and ask you to help him optimize the query.


I do something similar nowadays, and I’ve established a process for them to consult to us whenever these issues occur, it’s just that some don’t do it, they don’t care about anything. Of course, I’ve created processes that make it easier for my team to deal with people like that, but it doesn’t take away from them not knowing SQL


Don't really get this article. SQL never went away or got less popular. Always been in the background doing its awesome thing.


In data science it has been an uphill battle between the people who only think csv exists and those who switch to a new nosql flavor every 6 months


I think it’s artificially presented as an exclusive option between sql and csv/nosql/whatever in data science.

I know plenty of people who are comfortable in both. But I don’t know any “data science” people as good at sql as a good dba. But that’s ok.

I think a good analysis requires both sql and some flat file, if only for reproducibility. I’ve encountered many point in time analyses that can’t be recreated because they were just sql against db and there’s no way to reproduce the result months later. Are their methods accurate? Or auditable? It’s much harder because the source data is changed.

So I like talking about only-sqlers about the importantance of incorporating the data management of extracts and archives into the overall solution. And doing so in a way that doesn’t try to redesign the database trying to avoid the need for users to file away point in time snapshots of data.


CSV/Excel has run the world since the beginning, that's going to be a never ending battle. Large orgs are still entrenched into SQL when their enterprise ERP/MRP systems run off Oracle, and their analysts are extracting reports to Excel.


Right after flat files which are still in use on mainframes.


I agree, NoSQL is still working it’s way along the hype curve but a huge amount of business data still lives in relational databases. The fact that there’s money to be made supporting this market is not big news.


I think the big contender has been dataframe processing using Pandas, R and others. These have their own methods for doing the type of data munging and analysis that has been the traditional territory of SQL. I won’t soon forget the half day I plowed into getting the right Pandas pipeline dialed in only to realize I could have written a better SQL query to fetch the correct data to begin with.


I’m sure there will be an xkcd about this soon.

I’ve also tried to do things in sql only to realize that I could have done it easier in pandas.

I think it’s usually a mix of both for me, but I try to at least get a decent “tidy” dataset out of sql and then do all the manipulation in pandas because a python pipeline is more portable than a sql pipeline, I think.


Google Trends confirms, even MongoDB is only a fraction of SQL.


Thank god. I am working with it now and its crap in comparison.

I wrote a fairly complex mango query a while back, and trying to understand it now is a nightmare. Compare that to SQL, which is one of the most readable languages out there in my opinion.


To me Mongo is like Redis: it has its place in the pipeline but is not a proper tool to replace a rdb nor should anyone pretend it is.


I don't really understand what this article is trying to say, other than "isn't Looker great!".

Having timely and detailed data available to a wide range of people in the organisation is now seen as a competitive advantage in many industries. There's a lot of tech out there to help with this.

But I could have written this article talking about how companies like Reltio are relying on NoSQL solutions to "empower the enterprise", or how Firebase is allowing startups to not worry about data structures, or how HSBC is deploying blockchain solutions, or how Spark means you can combine data from all different sources. It would still be just as accurate and meaningful. As it is, it just sounds like an infomercial for Looker.


The weirdest part to me is that they went from 'Everyone is back to using SQL because it's good at querying and you don't need to learn a proprietary language' to 'Let's use LookML instead'.


Correct but the problem with raw SQL is that it's not composable easily. LookML essentially provides that composability on top of SQL.


I don’t agree - I find SQL easy to compose using CTEs and Functions - Postgres 12 is amazing for this because chained CTEs are now just as fast as traditional nested joins (which can be hard to decipher). Also lateral joins are great for composing set returning functions - Postgres joins them automatically - like magic.


CTEs solve the problem to some extent only for the data-sources. You often need to have snippets, expression functions, being able to compose multiple expressions, be able to compile SQL queries that are both efficient and readable if you want to be able to cooperate with different data analysts. It's not because SQL is limited, it works the way it's designed.

Take this example: https://github.com/rakam-io/recipes/tree/master/segment/ware... It also makes use of CTEs and lateral joins but it also needs to use a templating engine (Jinja) for data models and Jsonnet for analysis models similar to LookML in order to provide that composability.


Agreed. LookML is to SQL what SASS is to CSS: not necessary but it sure makes certain things easier.

I’m not connected with Looker in any way, and they are pretty expensive, but the product is awesome.


I agree with you, Looker is awesome but it's far from perfect. It does solve most of the problems in a convenient way but as the data modeling gets more and more complex and they're not going to able to iterate the product at that pace, it's the destiny of almost all the enterprise products.

Product analytics is one of these complex use-cases. You can't really find a convenient way to run funnel and retention queries in Looker, it only focuses on slicing/dicing the data with their models. Moreover, it has a high learning curve even for the enterprise and there are definitely better ways to approach this problem.

Disclaimer: We're actually working on a LookML alternative based on Jsonnet that will be a better alternative for event modeling. This is how it looks like: https://github.com/rakam-io/recipes. We recently published a VSCode extension for it: https://marketplace.visualstudio.com/items?itemName=rakam.ra...


I find lookml extremely useful but if you have something built for event analytics I can’t wait to try it. Looking at the docs, it’s not clear to me how I get started though. Maybe an introductory tutorial on open source stack and freely available dataset would help a little with that. Just a suggestion.


Its a blogvertisment by Holistics, which is a competitor to Looker.


Oh, I've never heard of them, thanks for the heads up. Explains why it reads like an advert - because it is.


These days blogs hosted on company sites are almost invariably marketing of some flavour. Even if they're tech heavy, it's marketing to e.g. improve hiring.


DBT (data build tool) [0] embraces this idea - it's like make for data transformation. Just like make its syntax is sub-optimal. But that's the only draw-back. There is an open source version, it generates documentation automatically, you can materialize tables on a schedule, it allows you to write unit tests for your data ("this select must return 0 rows" kind of tests). I'm not affiliated with them, just happy user.

[0] https://www.getdbt.com/


Yes, that was my first thought when I opened the article. dbt is amazing and basically converts your SELECT statements to CREATE TABLE statements.

I guess they target both EL from the ELT, while dbt relies on data being in your data warehouse.


The problem with modeling data isn't the lack of tools or the need to approach this problem differently. Plenty of solutions exist. People just don't care, and this is ok.

It used to be that you needed tools like Informatica and Kimball inspired datamarts, but databases are now bigger and faster. Whatever data modeling problems you may have, you can easily clean up in an ETL or a BI layer, with relatively little effort. This makes tools like Looker, dbt, and Holistics a luxury and not something you need to have. I wish the industry would put more effort into defining clean data models, but I think that ship has sailed. The prevailing trend seems to be to create Data Lakes, add a BI layer, then call it a day.

[edit] Also...some database points. The industry never shifted to using NoSQL to replace RDBMS systems. But event processing matured, NoSQL db's are ideal for storing unstructured data, so you see them in data engineering stacks. Greenplum is a free MPP database that has been around for nearly a decade. The point about Spanner SQL is interesting for the fact that Spanner evolved from NoSQL like methods to a SQL like dialect, but Spanner is a unique flower in the industry, due to being an HTAP db.


Can you elaborate more on what you meAn by “creating clean data models”?


Using Inmon or Kimball's approach to modeling data structures is clean. The thing with modeling data is there's no one right way but there are many wrong ways.


The article doesn't mention performance.

I haven't benchmarked this yet, but after my first experiences with somewhat complex data transformations in numpy and pandas, I was left with the feeling that despite them being optimized, any modern RDBMS would still have run circles around them.

They've been optimized to this kind of stuff for decades, after all.


Most modern RDBMS are optimized for point-lookup queries and have a very hard time choosing good plans when you're selecting millions of rows amongst billions.

Remove the assumption that the end result is going to be a row or ten (which usually have good obvious plans with clear indexes), and the query planner is forced to make decisions about choosing the best plan amongst very costly alternatives. The exact costs are sensitive to assumptions about distribution, I/O costs, memory costs etc. and those assumptions need delicate tuning, especially when you don't have hinting as a tool (e.g. Postgres basically doesn't let you give many hints - you can kind of force join order, but the biggest tool is the CTE optimization barrier - I can get 100x speedups by moving subqueries out into CTEs and ensuring they're opaque).

To write efficient big queries, you need to think in terms of data flow and write declarative SQL with an execution plan in mind; it reminds me of my days tweaking source code to encourage a compiler to make certain register allocation decisions.

PG is a bitch here. It's often easier to get good performance by fetching one query, transforming it into a comma-delimited string, and then injecting it into the subsequent query (possibly in batches) to be sure you're getting the plan you need.

MySQL's planner is kind of stupid, but it's predictably stupid and there's lots of hinting tools available. I find it a lot easier to make it work fast for big queries. Sometimes it simply doesn't implement the best strategy (e.g. no hash joins), of course, so sometimes it's not as fast as you could get with PG, but it's usually easier to outperform.


> To write efficient big queries, you need to think in terms of data flow and write declarative SQL with an execution plan in mind; it reminds me of my days tweaking source code to encourage a compiler to make certain register allocation decisions.

That mirrors my experience with Oracle. The only difference is that there's a hinting mechanism provided which will almost always allow you to force the execution plan you want.


Ugh hints. The things they almost always recommend you don’t use because the optimizer should know better. Unsarcastically, this is usually true. More often I see people who think they’re being clever killing performance with undocumented hints instead of either letting the engine do its work or looking for other ways to optimize.


>Most modern RDBMS are optimized for point-lookup queries and have a very hard time choosing good plans when you're selecting millions of rows amongst billions.

Any column store warehouse (Redshift, Bigquery, Snowflake) is terrific for these kinds of queries, though.


> I haven't benchmarked this yet

if you don't measure, you're just guessing.

> after my first experiences with somewhat complex data transformations in numpy and pandas

there are fast ways and slow ways to use numpy/pandas, but generally speaking, it's easy to get order of magnitude improvements using pandas compared to an RDBMS.

In pandas, for example, the data from a column is a contiguous in-memory array. Generally RDBMS data is row-oriented, possibly in memory or possibly on disk. Performing some numerical operation on the contiguous array is going to play to the cpu's strengths much better.


Being a total stubborn asshole to my boss over last decade because I refused to even investigate about how we could replace our old and "complicated" SQL integration/export workflows with a modern and "intuitive" visual proprietary ETL this article is a real relief for me!

I'm now officially a bleeding edge DevOps with 10 years expertise on the brand new "old school" ELT (Extract,Load, Transform).

LoL


Complicated SQL is definitely complicated, though. Get an expected 1:1 join wrong (so it ends up 1:n) and duplicates muck things up; and typically unit tests don't catch these kinds of errors because the test set is minimal.

Long scripts of SQL with intermediate steps (e.g. building temp tables) are hard to reason about and debug, and especially painful when they fail only in rare production scenarios.

Developers find it a lot easier to reason on an item by item basis, and not on a set or batch basis, and it's very tricky to get error handling and transactions right on a batch basis unless you want to fail the whole batch.

Putting all the work on the database also means you need to scale up that layer. From another POV, it can be simpler to have cheap point lookup, secondary indexes in stores that are tuned for the specifics of the queries, and get performance back by scaling out the compute.


1.If you rely on an ETL to automagically remove duplicates you are lying to yourself because you treat external symptoms, the illness is the underlying query and hiding this fact can only lead to more headaches later on.

2.Testing for duplicate is a SQL one-liner

3.If your data model is well conceived you can actually unwanted 1-n relationship to be inserted in the first place. Yeah sharding database is coming late to the game but ACID compliance is a killer feature. Code to manage inconsistencies in NoSQL is way less trivial as far as I know... unless it's ok to have an inconsistent database... Guess it depend on use case but ACID seems nice for any important dataset.


I think you misunderstood me on the duplicates. I'm talking about how easy it is to e.g. left join an attribute from another table and under-qualify the association so that you accidentally pull in unwanted rows. It's not that you have duplicates in the data; the data is correct and the model is correct. It's rather that the query is incorrect and the set of tests, owing to them typically being written by the same person who wrote the code, doesn't populate the association table with more rows than expected.

I mention this because it's a bug I've seen crop up several times in production. When you write 'join' you're typically fetching data to add extra columns to your result set; it's relatively unusual to want a row product in the absence of group by & aggregates, or an ORM doing eager loading of a detail table. But nothing in SQL requires you to state your product expectations on the join, apart from left/right outer vs inner join. I personally think SQL could do with forcing people to be more explicit on this point.

I fully agree on the inconsistencies in NoSQL; I think the most sensible way to use NoSQL is as a document store without relations, and keep the relations somewhere else. Manually maintaining invariants (especially of the denormalized bidirectional kind that some stores require) in an environment with less than complete transaction support isn't my idea of fun.


It's actually worst than that because cartesian product is the default mode of SQL. JOIN clauses are already a kind of mitigation that force you to be more explicit.

With modern SQL even if you only mention no table at all in the FROM clause you can get "duplicate" if you use set returning function (see: https://www.postgresql.org/docs/current/functions-srf.html).

This is both why you should be careful and why this is a really powerful tool to manipulate datasets. But usually "everything is a cartesian product" is one of the first (and yes very important) thing you learn of SQL.


Yes, I know, everybody knows. My point is that SQL doesn't help you catch those errors where your join clause doesn't match rows uniquely. You can know that the product is the default and still forget that extra bit in the predicate.


Hah. I guess that happens semi regularly in our industry, that by being stubborn and sticking with old technologies you may one day wake up and find out its suddenly bleeding edge again (kind of, in some form, anyway)


Like writing microservices in Cobol on Amazon Cloud. :)

https://www.infoq.com/articles/migrate-mainframe-aws/


Yeah, that is why I long learned to be skeptical in a pop culture driven industry, always looking after the new fad to sell stuff.


Like that it's funny to be a human. When all features have been added its so complicated one might as well start from scratch.... Wait I can put it even worse: What the Father Does, the Son Does - John 5:19-23


It’s a little bit off topic, but the tone of this discussion brings up an interesting conversation I’ve been having with some of my colleagues in the same age group - a perceived skills gap between the newest developers and ourselves.

I don’t mean that in a negative way. I mean it in the sense that many of the newest developers don’t know where their cloud based NoSQL database came from (for example). They never were taught the history of what came before, during and after RDBMS. They are only now rediscovering some of what the “old” tools could do.

Many of these developers seem eager to learn, and I am happy to mentor them and teach them the history that I know.

But it has surprised me, it almost feels a little like so many years of waves of marketing and hype maybe actually had a real detriment to teaching people what is real, what is the best tool for the job in different cases, etc.

I have no real evidence other than my anecdotal experience, but this article lends credence to the argument that some never learned or never were given the time to understand the discipline of databases.

Possibly, the discipline of databases and related development has just been continually developing and never settled, so that is why the curriculum hasn’t kept up. But, it really does concern me when a new developer doesn’t even understand what a JOIN is.

Edit: Or even moreso that SQL is an interface to a data engine, and was not necessarily always tightly coupled to relational databases (although it evolved often in lock step with them which is why you see them there more often).


> Instead, NoSQL systems like HBase, Cassandra, and MongoDB became fashionable as they marketed themselves as being scalable and easier to use

HBase did not - the project has always been very clear that they cater towards a very specific set of use cases - fast writes with little schema constraints, fast single-key and range/fuzzy lookups, not big ETL pipelines.

Even during the rise of Hadoop (everything is a file... I mean file based!) and the subsequent absorption of that into the Public Cloud vendors, SQL has always been there, just wrapped in different tools. These days, someone else hosts it and it's now called Athena instead of Hive, but fundamentally the same thing and has been the same thing.

Even Apache Sparks entire Dataset/Datframe interface yields SQL-like execution plans, exposing the same functions that an RDMBS would, just in Scala/Python/R.


Yes. Plus all the SQL-like things you can do now. For example storing data on S3 and then querying it with AWS Athena is a simple and powerful way to keep a huge archive of data that you may want to query at some point.

Also, never underestimate the power and speed of a well tuned SQL-family server/cluster, even at surprisingly large scale. A lot of use cases for the older “Hadoop Cluster” type stuff have been overtaken by these approaches. I’ve seen a lot of operations spend silly sums to build ultimately quite clunky Hadoop-based systems when really they probably just needed one half decent SQL admin and a well tuned cluster.


BigQuery is good. Looker is OK. This reads like paid-for PR content.


It's a blog page on holistics.io so it is paid-for PR content. Or a blogvert.


I think part of the real shift back to SQL is in part because of the fact that modern streaming platforms like Kafka give relatively simple mechanisms to implement eventually consistent databases. Aside from the simple key-value store use case, this was often a very good reason to use something like Cassandra instead of MySQL.


Great article. SQL is so widely known and relatively simple to implement that I'm baffled business don't start with it as a solution and then work their way into our solutions if they find it doesn't meet their needs.

It's so easy and quick to get started, it should be most people's first choice.


A few days ago I designed a DSL for CREATE TABLE statements.

https://gist.github.com/nomsolence/69bc0b5fe1ba943d82cd37fdb...

Being able to focus on the relationships without worrying about commas is nice.

I'm still writing the compiler (it's my first, I'm sure it will be awful), but I'm starting from the finish so it's been easy for me to pick up where I left off; I started by deciding the language, then writing the outputs by hand for the tokenizer, my two stages of AST, and the actual SQL.


Direct image link: https://gist.githubusercontent.com/nomsolence/69bc0b5fe1ba94...

Not shown is the second stage of AST; it resolves types for the SQL dialect, e.g. "bool" to "BOOLEAN" for SQLite (all caps for convention of course), as well as things like the special "ROWID" type for SQLite, and finding types for "FK" columns, adding the constraints declarations, and finally validation!

Maybe the first AST is an IR? I'll read the compiler design book and find out.


Hey, just wanted to chime with a tool I'm a co-author of, OctoSQL[1].

I too very like a common interface to various data sources. OctoSQL allows you to use plain SQL to transform, analyse and join data from various datasources, currently including MySQL, PostgreSQL, Redis, JSON files, CSV, Excel.

However, we're very inspired with yesteryears paper "One SQL to rule them all" and should have ergonomic steaming support with Kafka as a new datasource available very soon.

[1]: https://github.com/cube2222/octosql


q - Text as Data

q is a command line tool that allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files).

q treats ordinary files as database tables, and supports all SQL constructs, such as WHERE, GROUP BY, JOINs etc. It supports automatic column name and column type detection, and provides full support for multiple encodings.

http://harelba.github.io/q/


Do you plan support for Sqlite?


I see no reason to not add it, but we're currently busy so it'll have to wait till we either have time or somebody contributes it. Should be an hour of copypasting one of the sql data sources and adapting it for sqlite.


Does anyone have any ML Packages that directly integrate with SQL Databases? I feel like the ETL market is well covered with a myriad of tools, but our data scientists at the end of the day still want to extract giant CSVs b/c their python progs "just don't work natively" on an RDBMS (they probably do, but it's not the way it's done, and these guys aren't programmers for their first job anyway).


They exist, they're just not really state-of-the-art.

Apache MADLib[1] for PostgreSQL and Greenplum. Everything is done inside of SQL; for example, to fit a linear regression model, you execute the query "SELECT madlib.linregr_train(...)".

MLlib[2] for Spark now has a DataFrame based API (spark.ml) for easy integration with Spark SQL.

Every such library I've seen has been a little behind the times, usually offering nothing more advanced than logistic regression, random forest, maybe a multi-layer neural network. I've also never seen a SQL-based ML library that offered GPU training.

For reference, state-of-the-art neural network libraries like TensorFlow or PyTorch support GPU-based training (something like an order of magnitude improvment in training performance) and automatic differentiation (allowing for easy specification of complex models.) For trees, state-of-the-art now includes boosted tree algorithms like XGBoost or CatBoost. Random forest is still a workhorse, but a library which only provides random forest (like the two mentioned above) are a little behind the curve.

It seems that libraries that focus on doing one thing well (ML) are able to offer more features and have an easier time keeping up with times than libraries that also take on the burden of SQL integration. Which isn't to say MADlib and Spark MLlib are bad - not everyone needs state-of-the-art algorithms all the time, and it can be convenient to be able to do so simple things fully inside the SQL environment - but they're never going to be cutting edge, so they're not going to get much attention.

[1] https://madlib.apache.org/index.html

[2] http://spark.apache.org/docs/latest/ml-guide.html


SQL Server has had standard ML functions in SSAS for years “SQL Server Data Mining” extensions. In fact, I recall it may have been the first relational database platform to actually do this. It is obviously a commercial DB though.


I believe that more and more providers will offer ML inside the database. You can check what Oracle is doing.

Aside from that, some of the simple models can be fitted using SQL.


Looks like a promo


Just a slight conflict of interest having a company that makes SQL-based data modelling tools telling us that SQL-based data modelling in on the rise.

But given how many companies have setup data lakes with unstructured and semi-structured data (think SaaS exports) and how SQL layers have largely been unimpressive not sure it's the case.


And a lot of those lakes remain untouched for years..


It's not too far from datalake to dataswamp.


This article was written by yet another kid a few years out of college who doesn't know the history, the tooling and thinks he and his teammates are providing a solution to an existing problem. What he doesn't realize is that SQL reusability and data modeling have been solved 20 years ago.


Have you seen their tool? Without having used it, it looks pretty slick. I think you're being a little too dismissive, considering "It's All Been Done Before(TM)" and that they seem to be building a rather nice UI for BI... a business with multiple players all making a pretty penny.


I'm surprised to see only a cursory mention of Snowflake in this article. In my experience, they are really the pioneers of the new distributed, cloud-first database. They really enabled large scale relational data warehouses, and are still miles ahead of even the big cloud players.


What is it you like about them? IIRC, their base plan starts at $1500/mo, so it's a high end offering that I haven't played with too much.


Already started a DataOps team in my company!


That's because SQL is pretty awesome.



The concept-oriented model tries to overcome some problems of RM by relying on two constructs: sets and functions. In contrast, RM uses only sets. The idea is that data can be stored in functions and transformed via operations with functions.

(0) https://www.researchgate.net/publication/337336089_Concept-o...


In `Everything is RMDB`, Data is dominant, It combines the advantages of RMDB and NoSQL. and it emphasizes the following points:

```

It’s better to have 100 functions operate on one data structure than 10 functions on 10 data structures.

        ---- Alan Perlis
```


Except when you have large, telemetry style datasets e.g. web/product analytics which won't fit. Or when you are trying to build a wide table and you run out of columns. Or when your favourite SaaS products gives you highly nested JSON data.

RDBMS works great up until the point that it doesn't.


> Except when you have large, telemetry style datasets e.g. web/product analytics which won't fit.

Web analytics was one of the first applications for Greenplum. My understanding is that Yahoo collected tens of billions of events per day in the mid-2000s.

> Or when you are trying to build a wide table and you run out of columns.

HAWQ can run SQL queries over Hadoop clusters. Clickhouse's table width is limited by how much RAM you give it.

> Or when your favourite SaaS products gives you highly nested JSON data.

This is why major databases have JSON querying capabilities and why it's been added to the next SQL standard. PostgreSQL even allows you to define indices on fields inside your JSON structures.

Better yet: decompose the highly nested data. Relational databases begin to shine when you get past at least first normal form.

> RDBMS works great up until the point that it doesn't.

RDBMSes do work great until they don't. Which means they are almost always the best solution and almost always remain so.

Folks regularly overestimate the size of their problem and underestimate the capabilities of the literally dozens of RDBMSes now available for use. Yes, it irks me.

Disclosure: I work for VMware, which sponsors Greenplum development.


RDBMS doesn't work but columnar databases such as BigQuery, Snowflake, and Presto make it work. I agree that web/product analytics data is hard to get it done the right way thus people rely on specialized solutions such as Mixpanel and Amplitude but I believe that as companies started to be more privacy-concerned and want to fill the gap in between the product and other company data such as marketing, finance, etc, it will change over time. See how it's possible to run funnel, retention, segmentation queries in Firebase Analytics data on BigQuery with SQL: https://www.youtube.com/watch?v=7tXVG785_0M


Even Sqlite has json support these days...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: