Eventually we even moved a very complex scoring algorithm from an overnight Hadoop Java process, into Postgres as a compiled extension in C. We used to spend all night calculating those scoring results and caching them in an enormous table. But with the Postgres+C solution, we could calculate and store it in realtime by having Postgres do the hard part.
Thus re-inventing Lotus Notes.
But I refuse. REFUSE. to re-invent Notes.
I would like to veto this with deadly weapons - up to and including nuclear.
This article touches on one of my favorite things, too, which is built-in date and time arithmetic. That's something that basically just works and comes for free in the database but is often a hodgepodge of messy third-party libraries elsewhere.
The other enormous one, for me at least, is arbitrary precision math. Proper precision, rounding, division, and exponentiation isn't easy to come by. I'm so tired of dealing with libmpdec, BigDecimal (flimsy as it is), GMP, etc. It all works well in the database and it's pretty fast. Perhaps it's not suitable to 1000-digit scale scientific computing, but it sure does make building financial software easy.
In Common Lisp, postmodern (http://marijnhaverbeke.nl/postmodern/) serves a similar role, with both DAO and lower-level access methods that compose pretty well. The 'perec' ORM from DWIM project was based on Postmodern, and among some weirder things I did quite easily was embedding custom SQL into query builder, so that the object-based accesses would use the advanced SQL. Really nifty feature :)
I agree with this. It's nice to deserialise a DB row into objects, but actually querying with an ORM is usually terrible and people should not be afraid to write smart queries. Especially like those in sibling comment https://news.ycombinator.com/item?id=9112036 : date/time/arithmetic.
What do you think of stuff like Hibernate Query Language? I'm currently working on a project that might use DQL (a PHP variation) to get our snarl of tables under control.
The whole point of ORM is that it is database agnostic. Given how different databases twist the SQL standard and provide a whole raft of proprietary extensions you need an abstracted query layer.
Whether you give it a name like HQL or don't there will always be some layer.
I agree with Conery, choose the components that will compose your application wisely and then embrace those choices and use them to their fullest.
EDIT: I inherited this system. I'd never use an ORM by choice.
For more complicated queries, I often hand write them in SQL first and then translate them into SQLAlchemy's methods. It's still worth it, since it's both easier to build dynamic queries (e.g. dynamic WHERE clauses) and it makes the parameterisation trivial.
My problem with stored procedures is that you can end up with a lot of business code in the db.
Of course, even for practical reasons, it's important to keep the DB lean. But some code does really belong toghether with the data.
(a) Oracle, Microsoft, Teradata etc all charge through the roof for scaling out your database. Which of course you're going to need to do if you're adding more and more business code. Scaling out normal code ? Cheap.
(b) What happens if you exceed the capabilities of your database (yes it happens). Then you have a major project on your hands migrating both data and code.
(c) The platform for developing/executing your business code in the database is like going back to the 1980s compared to the new flexible, microservices world we live in today. Really stored procedures over NodeJS, Scala, Go etc ? Performance isn't that important in businesses where most processes are batch orientated.
(d) Databases in businesses are as tightly controlled as they get. As a developer would you want to have to go through laborious change management processes every time you push a commit ? It would cripple software development teams.
There is a reason why "data lakes" are all the rage right now. Because people need cheap places to store and process their data and then use their SQL database simply for querying and reporting.
(b) Been there, done that. Migrating code is EASY. Data is what's hard.
(c) Is that a joke? Honestly, I can't tell.
(d) That's more a reason to integrate those teams than to choose one place over the other. The programmers can disrupt the database on several ways without putting code there, and don't get any extra power by running their code in a different server.
By the way, the data that was famously leaking recently was mostly from email and file servers...
(b) Can you give an example?
(c) I don't understand what you're criticizing here.
(d) This depends on the organization. But generally business logic that would be at home in the database is business logic that should be controlled as tightly as the data is.
When I write a schema, I plan for testing it with dummy data. If it's PostgreSQL I write a test using PGTap. It's code. You test it. I don't get why everyone doesn't get that.
The validation would still happen on the node app side of things, but when it generates the create table query (as it must) it could easily also put in the equivalent check constraint.
Waiting on an official answer.
Another, less useful property, is a considerable reduction of boilerplate when persisting a complex object graph.
Create the optimized version in a SQL editor, then use the builder pattern to fill in the holes in the query.
Much less boilerplate than any ORM tool generates.
> Another, less useful property, is a considerable reduction of boilerplate when persisting a complex object graph.
That depends how expressive is the application language's meta-programming capabilities.
The thing is, if I have to fine-tune the ORM generated SQL for optimal performance, its value diminishes.
I disagree. The SQLAlchemy query language is very nice, for instance.
> That depends how expressive is the application language's meta-programming capabilities.
Certainly, but metaprogramming doesn't necessarily rhyme with readability.
> The thing is, if I have to fine-tune the ORM generated SQL for optimal performance, its value diminishes.
Of course. However: the SQL generated will be fine for the majority of queries. What's going to bite is the use of lazy collections. For complex queries, you may want to fine-tune it by hand (though not necessarily).
That said, I rather agree that ORMs are best avoided. The existence of lazy collections makes it very easy to turn what ought to be a straightforward, fast query into something slow. What I'd like to work with is a query builder for select/insert, and avoid magic, even at the cost of creating by hand awkward structures like PersonWithAddress and PersonWithAddressAndBankAccount.
Another side benefit is it lets the library authors worry about portability between databases so it's much less of an issue.
That said, there is a good argument to be made that, if you've factored everything correctly (ha) and done your design work up front properly (ha ha) you've rendered the database as just another sort of persistence engine, so even an ORM is not always necessary.
Why don't you use a foreign key to that table? We don't have that table... Yes we do, it's not because you haven't defined it in your project that it doesn't exist!
And... every project ends up with their own version of the "versioned schema", so to rebuild the DB, you need to aggregate all the liquidbase/flyway files from all the projects and run them all.
When I wear my sysadmin/pseudo DBA hat, ORMs are one of my more prominent pain.
That being said, I realize its often much more convenient at the beginning to have one database for integration. SQL is much more expressive than your service's interface will be, and in the beginning this seems like a killer.
- I'm so glad to see someone talking about `generate_series` for time-series reports. I often see queries that are `GROUP BY date`, which is usually okay, but will skip an output row if you have no data for a certain day. If you want to force one-row-per-day, use `generate_series`.
- His `GROUP BY` query can be rewritten with less repetition like this:
date_part('month', created_at) m
GROUP BY m
HAVING m = 9
- Also on this query: In Postgres if `sku` is a primary key, then you can `SELECT` other columns from `invoice_items` without including them in `GROUP BY`, e.g. `SELECT sku, name, supplier_id, etc`. I think even joined stuff is okay: `SELECT sku, suppliers.name`.
- One more nit on this query: I think this is backwards from what he meant?: "having to be sure that whatever you GROUP BY is in your SELECT clause."
- Usually I don't use window functions where a `GROUP BY` will do, but to me the biggest value in window functions is they don't force any particular structure on the overall query, so they don't constrain you the way `GROUP BY` does. He kind of alludes to this when he shows per-sku monthly sales combined with total monthly sales. Good example!
- I'm not sure his CTE example shows much added value from the CTE, but maybe I'm missing something. The three places I really like CTEs are (1) recursion, (2) with INSERT/UPDATE/DELETE and RETURNING, (3) very big queries. But a lot there depends on taste.
- When you use `string_agg` and friends, you can give an ordering for just that function: `string_agg(downloads.title, ',' ORDER BY products.sku)`. That syntax works for every aggregate function (I believe), although it's especially useful for the *_agg family.
Thanks for sharing these tips! I wish more people knew the cool things you can do beyond SELECT x FROM y.
For example, how do I apply DRY and use SQL ? How do I apply type safety principles? How do I avoid syntax errors?
It feels like this ends up with building some sort of ORM.
Anyways, I'd love to hear experiences on interfacing SQL with "good" safety principles in larger-scale projects
Lots of ways. Defining appropriate views is one of the key ones, but not the only one.
> How do I apply type safety principles?
That's too vague of a question to present a concrete answer to.
> It feels like this ends up with building some sort of ORM.
Anytime you are using a database from an external language, you are going to have some code to translate back-and-forth between database objects and operations and application language objects and operations; you might call it an ORM, but if you are using SQL more robustly, it may not look much like most things that go under the ORM label, and some popular ORMs are fairly opinionated in ways which fight effective use of SQL (ActiveRecord in particular at least used to be; I haven't done much with it in quite a while and am not sure if that's still the case.)
This API lends itself to TDD, and yes, you must test your database components the same as any other code. Preferably using database code. For PostgreSQL I really liked PGTap. Include this testing in CI and other unit test executions.
My sniff-test for a well-done schema is that developing application(s) around it should be clean.
The downside is that you need to express the command language, so you lose the ad-hoc querying capability that SQL gives you. But something more structured and more closely corresponding to the domain is generally want I want anyway.
If the needs have changed, the API should be upgraded. It seems like ad-hoc querying is mainly to get around shortcomings in the API, right?
...until your broken data starts biting you and your customers in places that don't typically see sunlight.
There are no places in any code base at all where it doesn't pay to be rigorous about DRY.
>It turns out, your model and validations don't actually prevent broken data.
A) DRY isn't about preventing broken data.
B) Validations are for preventing broken data. That's what they do.
Remote facade sounds like a lot of boilerplate to me.
I'll offer some of the reasoning that got me to building remote facades. The database needed to sanitize its own inputs; we could not trust application developers with that. There would be multiple client-languages (Ruby, Python, Java, for starters) and I wanted to keep domain adapters for those languages simple as possible. The database would be changing regularly, and many changes could be simply absorbed in the remote facade, so applications depending on the remote facade via their client-language libraries would be somewhat insulated.
Now, it wasn't a perfect regime, and nothing is. But it paid the bills for years and my schemas are still in play at that place.
For a large org, Slick also adds to the already significant Scala learning curve for Java devs. Scala itself is good and pretty, but when the first task you give someone new is to write a CRUD-like rest service using slick on one end, and spray on the other, they don't get to spend any time learning actual scala, and instead domain specific languages that are hard to reason about without serious Scala experience. No, this is not speculation: We are going through this problem at work right now, with dozens of developers learning Scala, and only a few of us that really started with significant experience.
If I have to use slick, I use the few direct SQL bits. From my perspective, is the one saving grace of the library.
What library would you use instead of slick?
ORMs are about code reusability.
First off, validation of input data is painful without ORMs. Converting data between SQL, your languages' type system and the user interface in a reusable way almost necessarily leads to an abstraction which looks like an ORM. Not doing so quickly leads to bugs and vulnerabilities. I can usually tell at a glance if Django ORM code leads to an injection vulnerability or not (Hint: Unless there is custom/raw SQL there is no vulnerability). Harder to do in pure SQL.
Secondly, queries can be modified and introspected in code. Hard to do with SQL query strings, easy to do with ORMs. With the Django ORM you can progressively filter a query using variables from your context, you can modify a base query to get you a count, an aggregate and a list of model instances, and you can introspect a query for the tables it will touch.
Thirdly, an ORM makes queries reusable by different sub systems of your framework. Again in Django, you can pass a Queryset (without evaluating it before it's necessary) to the template system, to the input validation (django forms), to form widgets, the introspected admin, REST frameworks and so on.
Fourthly, ORMs help manage migrations. Migrating to and from different points of migrations can be hard to do in raw SQL or with your own tools. ORMs aren't perfect at this, but SQLAlchemy and the Django ORM both have very good migration tools which save you a lot of time and a lot of frustration, with minimal headache now and then. Migrations are easily testable and easily repeatable.
I also believe that ORMs are more important to the framework than for the individual application. A framework can only survive and grow a strong community if it can satisfy a large base of users. For Django's success it was necessary to support SQLite, Postgres, Mysql and Oracle, whereas single Django projects might thrive without ever changing to and from Postgres.
However there's plenty things I prefer in mssql, but I won't go into them all here.
I didn't want to take away from the focus or be a jerk about it.
Things I prefer in mssql environment:
> Triggers work on entire dataset instead of row by row.
> Stored procs can return multiple data sets and output variables all from a single call.
> Opinion: t-sql is friendlier than pgpsql to work in.
> Backups/restores are more straightforward.
in pgsql you can have trigger that works per row or per dataset 
you can provide results through arguments by using "OUT" as argmode 
while it is hard to argue with an opinion, you can also use Tcl, Perl, Python (these three come standard), Java, PHP, R, Ruby, Scheme, sh.
This is one of the big features touted by ORMs, but I wonder how often it actually comes into play. I've been using Rails with ActiveRecord for most of the last 10 years now, and I've not once changed databases after starting a project. There's one project where I'd like to, but despite the use of ActiveRecord, there's enough MySQL specific stuff in the code that its not going to be trivial to just copy the data over to a Postgres database and flip the switch.
Databases are not interchangeable, though sharing a common query language makes the skill set somehat interchangeable. I think this is part of the nuance often lost in the discussion of ORMs and database-portability.
I have worked with many developers who actively deny this basic reality, in favor of trying to bury all that variability under an ORM, rather than exploit any of the proprietary features. I understood it as a crutch when I was still in the flat part of that learning curve, but I am a lot happier building a rich SQL application interface (non-ORM!) these days.
I see two realistic reasons for why having an ORM/avoiding DB-specific queries is good idea:
- your product supports multiple databases
- you want to use something like sqlite in memory for tests
In this case, I tend to see pretty light loads. You can do this with fairly vanilla (ORM-generated) SQL and design. When your business begins to lean heavily on the database for operations (I saw this in finance) around the clock, and you have to start being more judicious about your queries, these applications are usually the first to buckle.
> you want to use something like sqlite in memory for tests
Only if I have to support sqlite in the field!
I've wondered the same thing in the past, but just recently I converted two Django projects at work from MySQL to PostgreSQL. The transition was pretty much seamless--I didn't have to change any application code.
One of the projects was converted to add spatial capabilities via PostGIS. The other was converted due to an issue with how MySQL stores data (it ended up being easier in a time crunch to dump and reload into PostgreSQL than fix the issue with MySQL).
But I also just realized, there is a whole ecosystem of ORM based apps that can run on any of the supported databases. If you ever used an MVC framework and re-used apps, you have "switched databases" without even knowing, depending on what the original developers used.
You are supposed to bypass every ORM sometimes. One measure of a good ORM is whether it makes that easy for you or difficult.
Regex queries are definitely one example of a niche SQL query that I would bypass an ORM altogether in order to use.
95% of my SQL queries are not that, though, and for the very standard ones I use an ORM to heavily restrict the amount of code I have to write and maintain type safety.
You can export data into ORM specific, database agnostic format, such as Django fixtures or SQLAlchemy fixtures, and test against the ORM. This is actually one of the key ORM features to leverage.
That said, there are probably fewer SQL injections in ORM code than typical mysql messes.
I'm looking for a single example of ORM-based project switching RDBMS. It looks like it almost never happens.
For example, this is equivalent to the original query:
SELECT sku, sum(price)
WHERE date_part('month',created_at) = 9
GROUP BY sku
Of course, that only works because this example needs only a single month group. If you want multiple months and per-month price totals, then my rewritten query is no longer equivalent. It would need the more complicated GROUP BY. However, the window function version would also get more complicated in just the same way. The complication is inherent to the problem, not due to standards conformance.
Actually, what's really tedious is the verbose expression required to get the month and the fact that the author is repeating it. The standard provides "WITH" to avoid that duplication, and Postgres implements it.
Here's a reasonable version for multiple months:
WITH items AS (
SELECT *, date_part('month',created_at) AS month
SELECT sku, month, sum(price)
WHERE month in (7,8,9)
GROUP BY sku, month
WITH items AS (
,date_part('year',created_at) AS year
,date_part('month',created_at) AS month
,date_part('quarter',created_at) AS quarter
,sum(price) OVER (PARTITION BY year) AS yearly_total
,sum(price) OVER (PARTITION BY year, month) AS monthly_total
,sum(price) OVER (PARTITION BY year, quarter) AS quarterly_total
,sum(price) OVER (PARTITION BY month) AS month_number_total
,sum(price) OVER (PARTITION BY quarter) AS quarter_number_total
ORDER BY 1,2,3,4
It's admirable and fun to think about, for sure, but how much use does that feature actually get?
(Lukas from jOOQ here)
Surprisingly, that feature gets quite a bit of use. There are two use-cases (among many others) why our customers choose jOOQ:
- They're using a test database (e.g. H2) and a production database (e.g. Oracle), and things still work very nicely, while staying close to SQL
- They're using a client database (e.g. HSQLDB) and a server database (e.g. SQL Server), and can easily replicate without needing to remember subtle syntax differences all the time.
- They're selling complex products that run on as many databases as DB2, MySQL, Oracle, SQL Server, Sybase and they want to use SQL, because their queries have an average of 10 joins each.
ORMs just add a thick layer of fluff over an otherwise (mostly) portable language, where I'm concerned. I rather keep it simple between DBs and apps. If there is gross SQL in the app, make a view or function in the database. It isn't rocket surgery but you do need experienced staff to get it right.