Hacker News new | past | comments | ask | show | jobs | submit login
Let Postgres Do the Work (2013) (sorentwo.com)
420 points by Jarred on Nov 26, 2016 | hide | past | web | favorite | 202 comments

At work, we have an internally generated report that was done purely in ruby. It worked, barely, munging numbers for about 1000 records. When we hit 4000 records, the report took about 6 hours to run in ruby.

We looked for some obvious optimizations like eager loading and some map reduce magic and implemented them. That brought us to a range of 45 minutes to 2 hours.

We decided that it was still not good enough. We moved ALL of the report into SQL functions (Postgres) and we have the whole function running in 27 seconds.

I agree with some of the top comments that maintaining this is a little trickier than ruby code, but some ideas we came up with in maintenance: 1) Write the SQL in a ruby file that has a class method of sorts that returns the generated function. This allows for version control to track the changes of the sql function. 2) Focus on tests at the interface level. We had great test coverage on the end result prior to moving to SQL. We were able to re utilize those tests to verify accuracy with calculations that are now in SQL.

This really isn't for everyone, but for those that want to try something new, it is worth it.

I once reduced the running time of a report from 45 minutes to 3 seconds (900x improvement) by moving the code inside the database.

If a programming language wants to stay fast it must eventually become a database. I realize this is an unpopular opinion but popularity is the wrong metric to judge by.

576 hours -> 12 minutes (2880)

Admittedly the code I was moving away from was hilariously bad. Threaded custom build of PHP bad. Then again I haven't tried to optimise the code I wrote at all

You are absolutely right about fast code becoming a database, this is simply down to the query planner - it can try and do the least possible work for the data you actually have.

I recently used temporary per-transaction tables (CREATE TEMPORARY TABLE .. ON COMMIT DROP, basically CTEs that persist across statements, and that can be indexed) with json_to_recordset and turned a three-minute ruby ETL background process into a sub-1-second inline call.

CREATE TEMP TABLE is really awesome. Not really related, but I used it at my previous gig to optimize the unit tests. They would previously all use the same database that devs use during development, so scans over some large tables were particularly inefficient, and tests could break when someone modified table contents for some development task.

I implemented a small function where you could list the tables that you were working with, and it would create empty temporary tables with the same name in a namespace that takes precedence over the normal namespace in postgres' search_path, therefore giving you a blank slate to work with for the unit test, while the other tables were still populated with useful data. (Some of these data come from expensive infrastructure scanning jobs, so starting from an entirely empty database was not an option.)

What language were you using to do that?

Just curious - not sure if you were using Ruby like the parent above. I literally just built a library to back ActiveRecord models with temporary tables. It's very useful since you can use AR scopes off of your temporary table then (I had to do this instead of using CTE's because my DB currently doesn't support CTE's)


Just thought I'd share it in case it helps, or to hear if anyone else has a better approach to this. (warning: it's very experimental and not production tested yet). I couldn't find any other gem that would do this.

Cool idea.

How did you resist the temptation to call it temptation?

Although I think temptable is almost the opposite of contemptible, therefore also good going in the pun department.

My code is all Ruby, and I ended up pushing all the work into SQL and just eventually selecting with find_by_sql into existing models for the results. There is possibly concurrent invocation vs updates and a race against itself, so it's also all wrapped with a serializable transaction and starts with a mutex lock on the controlling model.

Hah! I was sitting around trying to think of a clever name, but then I got tired of sitting around and just went with temptable. Had I thought of temptation I would have gone with that.

The approach I went with works really well for cases when you want to persist the temporary table through the end of the request (it works good for master/index type views that may have a lot of sums/counts and filter options available on it).

One shower idea that I had is the concept of a clojure version that instead of compiling to JVM bytecode, javascript or CLR, compiled to postgreSQL. I think that would be awesome: you could just run the same functions, seamlessly, from the database, through the web server, to the browser. And, whilst of course you need to know about the limitations of the database, it could be great for pushing code to the database seamlessly.

Unfortunately I don't think I have the skills for that, so posting here in the hopes that someone that can likes the idea :)

This is kind of what LINQ does. You write a query in C#. The structure of the query (like an abstract syntax tree) can be exposed to the query provider which can interpret it or compile it to the CLR or compile it to SQL or whatever.

A sufficiently advanced LINQ (http://wiki.c2.com/?SufficientlySmartCompiler) would do wonders in some cases, but I haven't encountered it. AFAIK, LINQ to SQL only knows about joins and some aggregate functions.

One thing that makes improving it cumbersome is that the semantics of many operations are slightly different in the database than in C#. For example, SQL Server doesn't do NaNs and infinities, supports a zillion string string collations, and its date types may not 100% map to C# ones.

Also, databases may run stored procedures faster than the SQL that LINQ genrerates on the fly because they can cache their query plans (most databases will detect that an on-the-fly query is equal to one run recently, though, so this may be somewhat of a moot argument)

Which can lead to great, or less ideal results.. Seeing some of the resulting queries via monitoring are wild though.

At work we've got a LINQ query that gets passed around a few functions, eventually getting into a core of 60 lines of LINQ logic. Colleague verified that chaining selects produces different output, but gets ran at the same speed (Chain selects in order to somewhat declare variables, Select(x => new { x, y = <query-x> }) then you can Select(xy => new { use xy.y multplie times }))

Sometimes I think I should just be using sql.. (which we do on other projects)

Yeah, I've been pretty happy without ORM in node.js, I even wrote a semi-nice wrapper so I could turn template strings into parameterized queries. Made writing a bunch of migration scripts a cakewalk.

Sometimes it's really just easier to write SQL directly.

Do you mean something like Datomic? It has been intriguing me for a while...

Hence Kx, ZODB.

I'm not too familiar with ZODB, but it looks like it tries to impose OO on a database, while in reality relational model works best with data, so making programing language able to interact with data that way would be better. I think something like JOOQ[1] is close to that.

In order to get good performance you want to minimize number of back and forth requests over network. So instead of making a request to obtain list of items, and then fetching each of the item one by one (so called N+1 select issue) you will get a better performance if you make the database send only the data you want, nothing more, nothing less.

[1] http://www.jooq.org/

> I'm not too familiar with ZODB, but it looks like it tries to impose OO on a database, while in reality relational model works best with data, so making programing language able to interact with data that way would be better.

ZODB itself is essentially a transactional single-object store, where that "single object" is usually the root node of an arbitrary (possibly cyclic) object graph. Storage depends on the backend, nowadays you'd usually use file storage for development and relstorage for any kind of deployment.

It doesn't have any query language, Python is. There are no indices beyond those that you explicitly create (eg. a dictionary would be a simple index, but other packages provide more complex indices, like B-Trees - the nice thing here is that anything that goes into the DB is fully transacted, which removes a lot of headache that you'd have with other solutions (eg. flat XML files)).

ZODB thrives when your data model is too complex to easily or efficiently map into SQL, when you don't do or need efficient complex ad-hoc queries and when 180 % performance and concurrency (although this was recently improved through the MVCC reimplementation) isn't the highest priority. Since it's pretty good at caching (much better than most ORMs) performance usually doesn't suck.

ZODB is not a database, nor is it fast.

More like a database toolkit. Performance depends on application but was clearly never the top priority. (Can still be pretty impressive in comparison due to the rather effective instance caching)

Yep, the K interpreter is pretty mediocre (it is very small and the language is inherently pretty fast, but the interpreter is otherwise not very advanced¹) but being directly integrated with a very optimized in-memory column-store database means it smokes anything else. Even the massive engineering effort in a JVM+RDBMS simply can't compete with running K directly on memory-mapped files.

1. I believe they added SIMD support a year or two ago and got a massive (2-10× IIRC) speedup.

> This allows for version control to track the changes of the sql function.

You... realize you could just put the SQL itself in git, right? Not saying there aren't potential benefits, but "it can go in version control" isn't one of them.

Absolutely! We use rails. I just mentioned doing it in that method as opposed to throwing it in a migration. That migration shouldn't change in version control. We wanted a history of the changes and an easy way to access it in migrations.

Sorry for not explaining that fully.

Makes sense.

I suppose you could also execute "REPLACE FUNCTION" and redefine the inside a new migration each time you want to change it. The downside is that you end up with the whole function in new a migration every time you want to change if. But if you're not updating your stored procedures very frequently (which is usually the case, IME), it probably wouldn't be too bad.

At work we use a rather idiosyncratic tool called RoundhousE to deploy our SQL. It treats sql labelled as a stored procedure differently, and will run the file on every deploy, as opposed to scripts that alter the schema, which are only ran once, and may not be modified.

The system works pretty well.

You don't need a tool for this. If you store base data in one schema, and functions/views in another, you can update this second schema willy nilly without worrying about damaging your data.

Just start the second file with DROP SCHEMA IF EXISTS schemaname CASCADE; CREATE SCHEMA schemaname;. Then, make sure any functions/views/permissions built on top of this schema are also contained within this file.

Or you can just pull the SQL for the stored procedure from a table column and execute it dynamically.

Now it's an ops problem :-)

Haha, usually we hear this from smart guys who have been working for about 8 months...

Oh, that's slick. Nice!

Or you can just define the function every time you use it. https://www.postgresql.org/docs/current/static/sql-do.html

That won't help if you want to use it for indexing, as the article demonstrated.

You are right but a Rails migration would provide a way to deploy to the database. Far from ideal though, because you could have many migrations with that SQL code, one for each version of the code. Not sure what's the best solution in this case.

I'm not a database guy, but I have to ask: you're talking about 4000 records, which is essentially nothing.

Why not just load all the data from SQL into a DataFrame in Python or R and do the processing there? I've written some pretty big R DataFrame crunching scripts, including very wide-window moving averages operating on tens of billions of records, that run in less than five minutes on my laptop.

Hell, I have a web app built with Flask+Pandas, running on a single t2.small instance, that does complex queries into a 6000 records big chemistry data set based on user input of selectors. It can serve thousands of simultaneous users without noticeable slowdown.

What am I missing here?

4000 records with a one-to-many join (the many part capped at 240 records). The aggregations are done mostly on that join table.

In our original ruby process, it would return just that, 240 records per join and aggregate on that. Our SQL process aggregates that in a stored procedure and returns the flat 4000 records.

We really didn't want to introduce another piece of technology to this stack, hence we didn't look at R or Python or Elixir. We did consider storing all of this inside elasticsearch for a hot second but keeping things in sync at the correct times seemed sub optimal and would add a layer of complexity we were all uncomfortable with.

ONE join? A join is usually just a linear scan through one table, doing hashtable lookups on the other table. If there is no hash table, at the worst both tables have to be sorted on the join key and scanned linearly.

The absolute worst case of a "join" is a cartesian product which runs in quadratic time. 6000^2 = 36000000, which should still easily finish in ~1 second.

This really sounds like you're having an issue linked to the ORM.

I don't know the Rails equivalent, but Django's ORM has a way of prefetching related fields on a model in a single query.

My bet is that your code wasn't 1 query, but actually 4000 * 240 queries, which will always take a while.

EDIT: I just saw in a sibling comment that the actual data fetching was only taking 30s. I imagine you were doing that part mostly right

You can install R or Python into a PostgreSQL server as a language extension and then use it to define SQL functions. This is really useful to define functions which can reduce the size of the recordset transmitted across the network.

You wouldn't likely put the whole app to run on the database server, but typically PostgreSQL makes good use of the available RAM and disk, but leaves lots of CPU cores idle. So running some R and Python on the PostgreSQL server is an overall improvement in performance.

P.S. Yes, I know that PostgreSQL 9.6.1 will use multiple cores for aggregate and join queries on large tables, but few people have upgraded to that yet.

My immediate guess would be that they already knew SQL and had the data in there - and both R and pandas have a tricky initial learning curve if you're not familiar with dataframe style operations (hell, I was a pure math major some years ago and I have to contort my brain to do dataframes whereas I long since got comfortable with SQL).

It could be that each of the 4000 records needs another 10k records each to get the report you are looking for.

Oh absolutely. I've tried doing complicated logic with (MySQL) stored procedures. It's painful. Definitely send the queries from a sane programming language.

One way to think about this is that you just want to take control back from the ORM of the exact database queries sent down the pipe. Which, BTW, is why every ORM worth its salt has functions for dropping down to plain SQL - teach that as best practice for doing computation in SQL queries, not stored procedures.

Most ORMs will not do the level of aggregations you want or need without writing raw SQL within your ORM. If you are already writing raw SQL in your ORM for aggregation level computations, you might as well create a stored procedure, make it IMMUTABLE (if you can) and gain those performance benefits.

At the end of the day, I would just use the tool that has the most comfortability with the team on the project. You can obviously start with an ORM, but there may come a point in time where performance needs a little boost.

What are the real performance benefits of stored procedures over sending the hand-written queries over the wire through your ORM? I suspect small to none, and at the cost of having code that lives in the database rather than in version-controlled code.

code that lives in the database rather than in version-controlled code

If you think you can't version control stored procs, then you are doing something badly, badly wrong.

What's the difference between a "script" which is a text file on a filesystem at the end of the day, versus a "stored proc" which is a block of text in a database, loaded off a file in a filesystem?

The "script" gets deployed simply and reliably using your existing tooling, and stays in sync with the surrounding code.

Replace the stored procedure from version control at deploy time.

And then you have to make sure your deploys are atomic so that the app code and stored procedures don't get out of sync on failed deploys...

Or your stored procedures now become one more thing that has to be backwards compatible, in addition to your DB schema...

I've run into all of these problems with stored procedures, and they SUCK. Good deployment tooling can solve them, but it's a lot of effort for questionable reward.

a lot of effort for questionable reward

It's a small amount of effort for a thousand-fold increase in performance and not having to re-implement the same functionality in every app and in every language that connects to that DB. YMMV.

The argument was between using SQL in stored procedures or SQL in queries sent by the application, so you don't have to re-implement anything (just share the SQL queries between apps), and I very much doubt you'll get a thousand-fold increase in performance.

Thousand-fold increase in performance by shaving off network and query compilation time?

this absolutely mirrors my experience

One performance benefit of stored procedures in Postgres is that you can perform a multi-statement transaction in the stored procedure, saving the network overhead of multiple round-trips between the application and the database. This may not be a common occurrence for many applications, but it is a legitimate performance benefit when the need arises.

Note that the network protocol allows pipelining for multiple statements (including transaction control ones), too. Unfortunately only few client libraries (e.g. pgjdbc) make use of that :(

I don't know about Rails and ActiveRecord, but I can do this in Slick.

you can do multi-statement and nested transactions even in Rails

If you're already version controlling your schema, the CREATE OR REPLACE FUNCTION call lives in whatever's version controlling your schema.

If you're not already version controlling your schema, you should seriously consider starting whether you're using sprocs or not.

You can literally do version control in the database as well. Upgrade and downgrade your entire scheme at will. Unfortunately the only open source software I'm aware of that does this is Archiveopteryx



Honestly, it's yak shaving at that point. Like the article pointed out, you can create an index on a stored procedure that is IMMUTABLE. That will have some effect on the performance of the query.

There are other patterns you can use to track via version control the changes for your stored procedure.

> report

Always pure SQL, at least as you're using a DBMS. Never anything else.

SQL can stretch to full Turing completeness, so it'll always be good enough for that. And between JSON support and PLPGSQL most everything can be done in Postgres.

Unfortunately, "pure sql" usually means passing in strings, which leaves you without good tooling to manage the code. This can greatly slow down the speed of delivery. It's great if you are in an ossified section of the code/database, but it can be very brittle (or expensive to validate) otherwise.

Modern RDBMSes can handle most of the ORM use cases at their end. What's wrong with VIEWs, for instance?

Plus it solves the whole schema impedance mismatch thing to just leave it up to the DB.

ORMs aren't that great. Having spent the past five years primarily with an object database that is also fully SQL (ISC Caché) I find myself preferring SQL more and more. Even when you have the ORM within the bloody database I'm going SQL.

The string thing isn't an SQL problem, it's a tooling one.

The only tool per se should be pgAdmin or an equivalent. And in your application the only "SQL" that should be written is SELECT * FROM viewforthisfunctionality and select runupsertwithjsonobjectthatpgorplpgsqlcanreadjus tfine(json)

Have you considered a hybrid approach, like doing the heavy lifting/querying/calculating in Postgres, but the final formatting of the report (something that might change often, or be very finicky to implement in pl/sql or in the db) is left to ruby?

That's exactly what we did. We return an anonymous table of records in Postgres that are then passed to ruby to render an excel file. The SQL query takes 27s on average. The ruby view generation is fast as well but I haven't profiled that part of the report.

> That's exactly what we did. We return an anonymous table of records in Postgres that are then passed to ruby to render an excel file. The SQL query takes 27s on average. The ruby view generation is fast as well but I haven't profiled that part of the report.

What's an "anonymous table of records"? Do you mean it's just a result set (i.e. query)?

Yea. It returns a table of rows, where the return columns are defined in the stored procedure. I may have used the wrong nomenclature. I was trying to differentiate between a table in a database schema versus one defined in a function.

"The stored procedure returns a multi-row resultset via a SETOF[ROWTYPE] return value" might be a clearer description?

Thank you!!!!

I've had similar experiences. And it's not always Ruby processing being the slow part but returning multiple result sets to your application layer.

I've seen some huge result sets (thousands of rows) returned to do a simple aggregation for a dashboard widget.

Absolutely agreed. In our particular situation, the ruby calculations were the bottlenecks based on our benchmarks.

I took a 20-page multi-threaded ruby code that took 2-4 days to complete and changed it to a 1-page MySQL stored procedure with some @variables. Brought it down to 4 minutes to run. And to top it off, reduce the overall server load from 96% to 20%.

I kept that CPU usage chart and added it to my CV as it was quite impressive to look at.

Correct usage of ORMs if it's not in the database is another important and easy thing to get right. This often gives a 10-1000 times performance improvement just due to the avoided "ping pong" game by using eg. query annotations that tell the ORM which related objects you'll access.

6 hours to 15 minutes here. This is with the right indices and stock install. I am sure an expert can do much better.

27s is still an awfully time. Why does it take so long?

We didn't make it immutable and we didn't add an index to it. Those are our next action items to help improve the performance of it.

Honestly, instead of looking into more performance gains, I wanted the team to get a huge win and for the team to look good in front of senior management :D.

If you want to optimize for looking good in front of senior management, you would put a sleep() in there to make it run for 5 minutes. Then when senior mgmt comes back and complains that 5 minutes is still too much, you remove the sleep(), bill a few hours to time tracking, and praise the new performance improvements that you just deployed. ;)

Yes, this is exactly the article that I had in mind earlier. Thanks for digging it up.


Ha ha politics driven development. Love it.

Ehh. More about boosting confidence and team morale than politics. The goal was achieved, and we can yak shave during slack time if need be.

This is an example of what I call "embracing PostgreSQL". Roughly speaking it means ditching the ORM layers and using a simple API to run any SQL query that you can think up. Then your application can leverage the many features that exist in PostgreSQL and not just treat it as a dumb SQL server.

Using functions as an intermediary to raw queries, also decouples the SQL details from the application. What I mean is that if the application calls an SQL function to read, update, insert data then you can revise the SQL functions independently of application code. In other words, if you refactor an SQL function, you can deploy the change to the database without rebuilding and deploying the application.

This is especially helpful in companies that have an onerous verification and validation process for application changes, because often "performance improvements" are classified as administrative changes which can be done with less bureaucracy.

So, embrace PostgreSQL as a permanent part of your application and reap the benefits of its many powerful features.

Our application uses JSON throughout, and i've found the JSON functions in postgres to be invaluable!

I recently refactored several functions which kind of organically grew from several different functions each with it's own SQL query into one big query.

Not only did performance increase by a significant amount, the whole function is basically just "run this query and return the result" as the JSON functions in postgres allow me to leave the actual processing in the database and to not have to do the always annoying "loop over the 30,000 rows that were returned to build a JSON file" thing which is never fast.

JSON in Postgres is so fast... it's definitely not used widely enough.

I'm not even sure why people would use something like Django Rest Framework, when Postgres outputs JSON directly from a query.

Because they want a RESTful endpoint, not JSON-over-SQL. Would you open that up as a public API with free registration? How do you do access control using OAuth? How would you do stuff like convert images to thumbnails, extract text from PDFs or send an email?


Anything more complicated than that, and I'd recommend a simple Go application.

Both of those are adding a layer over Postgres; they are still "something like Django Rest Framework". Whether it's written in Python, Haskell or Go is not really the point I was making.

There's a lot of substance to the stack chosen. The performance Haskell or Go offer over Python let alone Django, a fat framework, make them far better picks for thin wrappers around an RDBMS.

Because it is easier to maintain Python code rather than SQL.

As someone who's thought about embracing over using the ORM, here's a couple issue that come up:

- Version control. The SQL functions end up just being data in your DB. This feels like an anti-feature in the world of continuous integration and code review.

- Lack of composability. An ORM gives you the ability to easily compose filters. This lets me write functions like "take this query, and make sure it's properly filtered down to one user's data". Because of how SQL works, each query needs to be hand-crafted, and it's hard to make parts of it generic.

- Rewriting business logic. I have business logic in Python already, but if I want to query off of that in the DB in SQL, now I need two implementations. You can sort of get around this in an ORM by annotations.

- I'm not sure what the developer tooling environment is like. PyCharm is pretttty nice.

To be honest, you citing that it's easy to get around the validation process when using Postgres is a major red flag. Why would I want to circumvent code review for things that are most undoubtably code?

The right way to have composability in a RDBMS is with things like parameterized views or table-valued functions. This would fall under embracing the RDBMS, leveraging its modern capabilities.

Version control. The SQL functions end up just being data in your DB. This feels like an anti-feature in the world of continuous integration and code review.

Functions can be treated the same way as migrations. With PostgreSQL you have transactional DDL, so you can atomically deploy and rollback functions. There plenty of different script packages to help manage this. At the end of the day it’s deploy and rollback SQL scripts in text files, which are easily managed with version control.

Lack of composability. An ORM gives you the ability to easily compose filters. This lets me write functions like "take this query, and make sure it's properly filtered down to one user's data". Because of how SQL works, each query needs to be hand-crafted, and it's hard to make parts of it generic.

The trade off here is that you’re trusting the ORM to produce an optimal query (or queries) for whatever you’re trying to do compared with writing the SQL yourself, with knowledge of your schema. Like any trade off, its one you need to decide for yourself.

You may have filtering on the data that you’re returning from the database that you chose not to do in the query itself. At this point, you should just be dealing with data structures. I don’t think this would be different whether you’re using an ORM or not.

Rewriting business logic. I have business logic in Python already, but if I want to query off of that in the DB in SQL, now I need two implementations. You can sort of get around this in an ORM by annotations.

I’m not sure what you’re getting at here, so I’m not sure if this addresses your concerns: If you’ve got clean separation of your code, you should only need to update a single location for any given piece of logic. Single source of truth, and all that.

I'm not sure what the developer tooling environment is like. PyCharm is pretttty nice.

SQL scripts are generally text files, so anything that supports some kind of SQL syntax highlighting is likely all you need. I suspect PyCharm will work. The Jetbrains folks are pretty sharp.

Why would I want to circumvent code review for things that are most undoubtably code?

Since you can have your scripts under version control, you can likey use the same code review tools that you currently do. You can also use libraries such as pgtap[0] to test your schema and functions indepedent of the rest of your application, if you so choose.

These are all issues I've dealt with in a large production PostgreSQL environment. They're good things to think about, and you're right to consider them. At the end, you need to make a decision you and your team is comfortable with.

[0]: http://pgtap.org

I had to work on/with a rather large backend which used this approach. Most of the logic was in stored procedures, with python/sqlalchemy used as a glue and to process requests.

It was absolutely horrible to develop and a nightmare to test.

For one, there's nothing approaching ActiveRecord migrations in stability and ease of use, so schema changes on dev/test/deploy were a huge pain. Then, you can't really work in normal "auto-run tests on save" mode as you would with saner frameworks, syncing source from git to postgres is needed (the principal dev altogether preferred to edit directly in pgadmin and then sync). Then, SQL tends to be a lot harder to read than python/ruby, especially so if not well documented and with many join tables. And finally, you can't scale out the heavy stuff, you are limited by your master instance basically. 2/5, would not recommend.

With the correct abstractions, you can definitely run tests on save.

SQL is rather readable these days. It's obviously not as easy to read as ruby or python and is slightly more verbose, but it is certainly readable.

Mileage will vary of course. There are cost/benefit trade offs with every architectural decision made.

> SQL is rather readable these days. It's obviously not as easy to read as ruby

As someone who can't read Ruby but can handle "any level" of SQL fluently, that's obviously not a very objective statement.

I made a very broad assumption based on my past experiences with engineers and I apologize if I offended you.

Most engineers I have met in my career have really depended on an ORM to do all of their heavy lifting. They can write some of the most eloquent ActiveRecord queries but would be unable to create the same functionality using raw SQL.

Again, deeply sorry if I offended. It was not my intention.

Wow, is this really the norm? I mean, not knowing SQL and working purely with ORM. This seems weird to me, as I learned SQL first.

Yes, that's definitely the norm. Having some experience with Rails, I'm pretty sure that 90% of Rails developers have never written any SQL. If they need to access the production database, most of them would use the rails console on the production app server instead of psql etc.

And I can sympathize. For example, I wouldn't even know how Rails maps a many-to-many relation to SQL tables, so I would have to use the rails console myself to traverse such a relation.

Although Rails is very reasonable with its mappings and they are easy to use directly, I prefer Rails for db manipulaton, it is usually much easier and faster to type 'User.find(1).favorites << Product.find(10)' than an equivalent SQL.

Very true, but I've found that it's easy for those ActiveRecord abstractions to kill performance (I did this on a dashboard that was performing some serious calculations across hundreds of millions of records, often joining other similarly sized tables)

I don't know if it's the norm or not, but leveraging an ORM without knowing SQL is a thing. ORMs have so many benefits (query parameterization just to name one). That and people can stay in their comfort zone language (Ruby, Python, Haskell, etc) and let the ORM do the heavy lifting instead of having to dip down into SQL.

Those that are really good with an ORM can generally go into SQL with ease.

The above is not meant to be objective, just my general observations.

To be clear I'm not against the use of ORMs.

They are a fine tool, but I find it weird that people are using them without knowing the underlying technology which they are based on.

And you also don't need ORMs to do query parameterization.

Also, think of the various boot camps (General Assembly, Launch Academy, etc). Those do not teach SQL, but do teach some ORM.

Seems to be, especially among younger developers that learned Rails, and even more so among boot-campers.

Yeah, well, not everything everyone says is objective.

>> With the correct abstractions, you can definitely run tests on save.

You definitely can, but you'll have to roll your own watcher and sync, there's no Guard equivalent that I'm aware of.

>> SQL is rather readable these days.

Used to think so as well. But then I had to read some pretty nasty sp's which calculated discounts depending on what user already has, and geolocation, and time of day, and day of month, and tons of other stuff, and the return value had to be a complex hierarchy of categories/products. So what would be 2 pages of python turns into 500 lines of SQL with some statements 80 lines long, plus it was still a pain to json-encode in python. The only positive thing about it was that I wasn't the one who had to write it.

> You definitely can, but you'll have to roll your own watcher and sync, there's no Guard equivalent that I'm aware of.

[shameless plug] here's mine :) https://github.com/oelmekki/pgrebase

EDIT: sqitch is also often mentioned in that field http://sqitch.org/

pgrebase seems very nice, thanks!

This sounds like the approach was taken but the necessariy development tools were not created. Hence the painful experience.

I don't quite agree it's the tools only, my bigger gripe was total lack of scalability. Basically, even for moderately complex calculations, you can cache a lot of slow-moving stuff (such as products, categories, etc) to redis and do most of stuff on web_worker_{nnn}, using potgres only where you need ACID guarantees (eg orders, payments, etc). When you do everything on your master instance, any heavy calculation can become a problem pretty quickly.

> When you do everything on your master instance, any heavy calculation can become a problem pretty quickly.

SELECT-only reporting code like this is normally run against a query slave, and with a bit of automation you can stand up one of those pretty much as easily as another webhead.

Then how is it better than simply using another webhead? In any case, we need to do heavy calculations AND transactionally update some state. It's possible to do this on slaves I guess with some trickery, but then what's the advantage over doing it in the app servers?

> Then how is it better than simply using another webhead?

Data locality.

> In any case, we need to do heavy calculations AND transactionally update some state.

In that case, I'd have the app server connect to both master and slave, run the heavy calculations on the slave the same way, and then send the updates to the master.

Well you didn't mention that part in the parent comment ;)

Cool demonstration of IMMUTABLE functions (TIL), but there's a reason people have moved away from stored procedures/functions in RDBMSs in favor of moving this stuff into much slower application code -- you now have business logic living inside of the database. From there, you have to now think about how this affects:

- Testing

- Deployment/Migrations

- Day-to-day tooling to work on it

- Switching costs (mental overhead, using multiple languages)

- Documentation

- Probably more stuff

The ruby code is just like all of your other ruby code and can be grokked and worked on by anyone who is up to speed on your ruby stack. The DB function meanwhile requires jumping through lots of hoops and will always feel like a second class citizen relative to the vast majority of your application code, unless you do a lot of work and maintenance (unlikely to happen without feeling distracting if this code is a relatively small slice of your code base.)

In some cases, this can be worth it for massive optimization wins. But the article doesn't really touch on why this might be a bad idea.

I do think it highlights the potential of systems like MADLIB [1] where you can push the data crunching into the database, while having that be abstracted away in your primary programming language (in this case, R.) One could imagine a world where you wrote the ruby code and it was somehow mapped in to be run on the database. But that's not the world we live in today unfortunately. It's really a shame, because people using a database as capable of PostgreSQL but with a fairly dumb ORM on top are under-utilizing its potential, for sure, but there's not really a great path afaik to leverage it. (I remember years ago MS introduced running C# inside of SQL Server, I am not sure if that ever ended up panning out into being something smarter than the dumb "upload code to database manually" model we are used it.)

[1] http://madlib.incubator.apache.org/

I may be in the minority here, and I'll happily accept that if I am. But my rule of thumb is that anything that can be done inside the database should be done inside the database.

But I'm a python/C#/.Net person.

The stored procs/functions go into a repo just like everything else does. It's not that big a deal. Alembic for migrations is pretty solid. And if you're dealing with a large code base, you log changes inside the DB itself. Use the database for version control.

It's really manageable. It requires that your dev team really knows SQL. But I don't think that's a bad thing.

It's fast, gets the job done, and if you're running a language that doesn't make it easy to deal with concurrency, it's nice to just let the database handle transactions.

For everyone who isn't facebook or google, this is the way to go, in my experience.

Let the db do the hard work, let your preferred language be the middleman, and have a reasonable front end framework do the presentation.

I can scale a web app to millions of users on a couple hundred bucks a month of hardware with this approach. You can too.

> my rule of thumb is that anything that can be done inside the database should be done inside the database

For non-distributable (or at least non-multi-master) DBs, you are often using your most expensive and least distributable resource to do your cheapest and most distributable work. My rule of thumb is that unless you need atomicity or the performance (as profiled) makes a real difference, do it outside the DB. Also has the side effect of avoiding system lock-in (at the expense of language lock-in which I think is ok).

Granted, it's subjective, but I've had the most trouble scaling DBs than other systems, and it's usually CPU/mem of business logic pushing the boundaries.

I don't often find myself in a situation where I don't need need atomicity for the work I do. So I'm speaking from a point of view that reflects that.

Even if I didn't need it for a project, I would probably start there because, you know, that's the hammer I have, so SQL is the nail I'm going to start with.

I haven't had problems scaling DBs though. DB throughput has never caused a problem for me. What has caused problems is business logic put in what I consider the wrong places.

Business logic belongs in the database structure wherever possible because that's the single source of truth for your data. That's where all of your controls for data integrity belong.

For any app that has longevity, you are going to have multiple methods of accessing the data. There will be APIs and crappy junk that wants your data. You have to account for this. There will be users logging directly into your database. Sometimes it will be your boss.

Start with an ACID approach to your CRUD app, and you probably won't go wrong. At least not soon. And you have options for scaling when you need to. You can scale vertically or horizontally or both.

> For non-distributable (or at least non-multi-master) DBs, you are often using your most expensive and least distributable resource to do your cheapest and most distributable work.

I see your logic, but I think this is misleading, and often the reason why people do inefficient things.

The pattern I see regularly is if the database does more thinking, its planner will diminish the amount of actual work; it will touch fewer rows from disk, and less data will result sent over the network and in fewer round trips. Overall for any extensive report you're likely to be paying significant cost to distribute the work, often more than actually doing the work.

Of course, as you say it's subjective. Scaling databases becomes necessary at a point, but it may be because of developers not embracing them; I think this is why NoSQL has gained in popularity.

Exactly, you really need ACID only for important stuff, other data you can cache on workers where you run most of the logic

If you're doing reports you could run them against a read-only slave.

> It's really manageable. It requires that your dev team really knows SQL

I'd question the credentials of any developer who doesn't know SQL. It's an essential skill across the entire IT-indutry. Not knowing it means you're obviously not serious about your job.

And here you're talking like programmers not knowing it is a normal thing... Are we working in the same industry?

Feel free to question my credentials, then. I'm 44, I've been a dev all my life, and I've never worked with SQL. I've worked with distributed in memory data grids, strange custom databases specific to trading systems, and many other data storage systems, but I've literally never written a single SQL query that has gone to test or production. I've also written tons of code that hasn't stored data, including my current project (an IDE).

There's a lot more to life than CRUD apps.

The chain of comments leading up to yours is interesting. One person said that if you've got a relational database you should push lots of work down to it, which they say requires lots of SQL knowledge, and they think having that knowledge is good. Then another person replied and said you're not a real programmer if you don't know SQL. Then you pointed out that it's entirely possible to be a developer and never touch relational databases.

I'd like to support your comment, and argue for a more restrictive claim. Namely, a developer is more valuable the better they know their tools. If they're working with a relational database, then it's good for them to learn more about how to use it. However, if you're running a team that is building some software system, you ought to make technology decisions which make sense given the skills of your team members; so, if the team's SQL knowledge is weak and there aren't compelling performance reasons to push lots of work into the database, it's probably not the right choice.

I appreciate your tone and like your comment (read: upvoted). But I want to respectfully disagree with it.

I don't agree with the part of your comment that says, essentially, work with the tools your team already has.

There are some cases where you have to choose the technology that makes the most sense and your team has to get with the program. If you're storing data in any way, you have what I think of as an obligation to choose the best technology to store it.

I know and have worked with people who would write everything to a csv file because they know how to do that and prefer to do that. And toss it on a server somewhere with no redundancy or backups of any kind.

We all come from different places and have different jobs. But some times you have to suck it up and learn a new thing to be effective in this industry.

Sometimes a CSV file actually is the right answer. Sometimes a database of some flavor is the right answer.

The point I was trying to make and didn't do such a good job of was that if you're dealing with a web app, a database is probably the right answer, and if you're using a database, you should maximize what it can do for you.

I should have made my comment more explicit. I'm talking about web development.

There are a ton of jobs which my comments don't apply to. And yours sounds like one of them.

There are are entire areas of my work that also have nothing to do with collecting or storing data, and I recognize that SQL has no place in that work.

I was speaking to the bulk of my work and assuming (perhaps wrongly) that we were talking about similar things.

My bad. And you are totally correct about life being more than CRUD apps.

To be clear, I was replying to josteink who said: It's an essential skill across the entire IT-indutry. Not knowing it means you're obviously not serious about your job. I thought your comment was totally reasonable :-)

Sometimes when you question you'll find that the person really has the chops :-)

I'm sure you'll agree however that for every one of you, there's ton of developers who are less programmers and more assembly line workers gluing framework pieces together :-)

There is a large span between "knows SQL" as in "can write and read basic SQL query" and "really knows SQL". And I bet there are tons of developers who never have had the need to touch SQL or even any RDBMS at all. It's an essential skill in parts of the industry, it has no practical relevance in others.

Upwork.com is a great way to fill that gap: tons of great SQL guys, very reasonable cost.

How much SQL does one have to know before they are "serious about their jobs," in your view?

You can write your stored procedures in a number of scripting languages. Out of the box PostgreSQL supports Tcl [1], Perl [2] and Python [3]. There's also a general mechanism to register other languages [4].

On github I found [5], which is embeds Ruby in PostgreSQL. I haven't used it and don't know how mature it is.

Using the same scripting language in your stored procedures and application code might lessen the impedance mismatch. It doesn't solve testing, deployment/migrations and tooling.

[1] https://www.postgresql.org/docs/current/static/pltcl.html

[2] https://www.postgresql.org/docs/current/static/plperl.html

[3] https://www.postgresql.org/docs/current/static/plpython.html

[4] https://www.postgresql.org/docs/9.6/static/plhandler.html

[5] https://github.com/knu/postgresql-plruby

That pl/ruby embedding is unmaintained for years and doesn't currently build.

To me the most interesting current extension language is javascript, via PL/v8: http://pgxn.org/dist/plv8/doc/plv8.html - now you can have isomorphic code at all three tiers of your web stack.

While you're right of course, I think you also just described why people have been switching to SOA (cough, I mean microservices).

If you encapsulate the database and related logic behind its own service interface, then as you get big enough to care about query performance, then only the team working on that microservice needs to care about how complex/tricky/whatever the optimizations become!

That certainly sounds right -- though I guess I still am very skeptical that the complexity tradeoff there is the right one (at least until you hit the scale that dictates it.)

And because you can't really report across microservices, everyone is now setting up datalakes, where they aggregate all the data of their services.

I believe that many people don't use stored procedures/functions because of lack of knowledge/know-how, difficulty or needing something done yesterday (which in itself is related to the previously), unfortunately.

I say this because I've seen it happen multiple times.

We are using Liquibase to manage the SQL code in the database as a separate GIT repository from the several applications that use the database. We have a dev database, and are beginning to use pgTAP for SQL unit tests. Code review for the SQL comes from BitBucket. GitHub works similarly for code review. The SQL code can be versioned and deployed as changesets either ins synchronization with an app change if that is appropriate, or separately. If you also embrace the concept of "database refactoring" then there is less need for migrations in lockstep with an app deploy.

So, putting some of the business logic into the database can be done using modern development best practices.

Yes there are switching costs and it means that developers need to really learn SQL and PostgreSQL's advanced features but it pays off. And if you break the lockstep of db migrations and application deployments then you can easily have people specialize by language.

In fact, I would say that the era of the DBA is fading and what we really need are devops developers with a strong understanding of PostgreSQL who can do db architecture, db engineering, db design, and SQL coding/refactoring.

Probably my perspective is different from a lot of people because I worked many years in large ISPs where Network Engineering folks designed everything, built it in test labs, verified everything, documented how to deploy and then handed it over to an Operations team that actually set up the routers and installed the configuration code. I don't see coding as the top of the pyramid. Instead it is Engineering Design and Architecture of applications that we should all aspire to.

Actually most of our enterprise customers prefer to let the database to the work.

Then again we tend to work with the likes of SQL Server, Oracle and similar.

I would be interested in seeing what your actual proposed solution to the example problem is here. You seem to be implying that we should pull back the full list of posts and order & limit in memory. This might make sense for some scales and workloads but it definitely wouldn't be my first or second solution for the given example.

today, lambda (speed layer + batch layer) or kappa architecture (kafka streams) -- ymmv. not sure if I would build that infra out just for this single problem, but generally once that infra is in place it can be applied to many problems including this one.

edit: (assuming i can't get it working with a jury rigged cronjob and the usual dumb solutions that will still use the ruby code. i'd try a bunch of those types of solutions first obv.)

You can run R code in SQL Server, Oracle, Teradata, Hana and Vertica.

All of these databases kick the crap out of HN darlings like Redshift and BigQuery. Special mention goes to Vertica which is free up to three nodes and/or 1TB data + 1 TB in flex tables. It is, like the rest, analytical AND transactional. Has proper admin features, proper edge case coverage and a truck load of features. It also has a very sensible licensing.

The article makes a good point that if you want ordering by a computed 'ranking' to be performant, your database will need to store and index that value.

But then it takes another leap and says that this means your database should actually compute that value. I don't see why that's necessary, and it would bring with it all sorts of issues that other people have mentioned (maintainability, scaling, testing, coupling).

It seems you could equally solve the problem by adding an additional indexed 'ranking' column, and computing and storing the value at the same time you insert the row [1]. It seems that's essentially what Postgres would do anyway.

Also, I'd note that the algorithm here is very simplistic, and even so, the author had to make a functional change in order to get it to be performant in Postgres. You can't just substitute an ID for a timestamp, just because both are monotonically increasing. The initial Ruby version of this algorithm treats a given 'popularity' as a fixed jump in some amount of time (e.g. 3 hours). The Postgres version treats it as a fixed jump in ranking (e.g. 3 ranks). Those are not equivalent.

[1] This does assume that you can control all the database manipulation from a centralized place.

> It seems that's essentially what Postgres would do anyway.

Yes, but if Postgres does it, it's literally impossible for the developer to screw up, whereas manually putting the ranking in the table can easily cause inconsistent records if you don't know what you're doing.

And even if you know what you're doing, next month's code change might not remember that it's important to update the ranking with the rest of the record (unless you're calculating it in a before_save hook or something).

At my last job we finally got our front-end devs to move all client side db work to calls to stored procedures or views. After doing so, nearly all of them fell completely in favor of this method. This is escpecially true with large, complex DBs or data warehouses, where the model is subject to change and reliance on back-end devs better knowledge of the database is more important.

That said, despite some of the tradeoffs - for example with testing, or migrating to a new db (which is extremely rare anyway) - it is ultimately far better for client code not to have to know back-end implementation. I think things like graphql are finally making this abundantly clear.

What I have been dying for is a real time DB BAAS that is ACID compliant, preferably relational, has a simple rest api, and allows me to write stored procedures that I can call from my client code. Horizon is probably the closest thing out there. Right now using firebase, and sick of the absurd amount of client side code I have to commmit to to pull data from various parts of the database. Requires huge amounts of overfetching, unecessary coupling of client-server code, horrible consistency support, overly loose type structuring, etc.

If somebody writes a postgres version of horizon I will pay big money to use it :)

Doesn't include everything that Horizon does, but it provides a good foundation for building it http://postgrest.com/

Cool, hadn't seen that. There is also postgraphql, which I have become a big fan of and is a great foundation:


That said, the real-time aspect is what I'm really dying for, plus a company supporting a BAAS. The ease of firebase is so nice, and you can make really cool real time apps with it very fast. I'm not sure if switching to a relational model would make impose some technical limitation that json stores don't (besides obvious complexity). Postgraphql + socket.io is kind of what i'm thinking is the start

Some clever use of NOTIFY might help with real-time? It's basically publish/subscribe directly in Postgres itself, but I've never actually used it.

I had forgotten about this actually. Thanks for the reminder :) Curious if anyone has tried using this for real time apps and what their experience was

Unfortunately NOTIFY doesn't work across nodes.

You might want to take a look at dreamfactory[1]. They have a nice app that wraps all manner of services to become a self-hosted MBaaS. I've used it to put a REST interface on my postgres database, views, and stores procs.

[1] - Dreamfactory.com

> where the model is subject to change

You cite this as the primary reason, but how much easier would this have been without stored procedures and especially views? I would think harder if anything.

It can be the primary reason, but it is generally one of a few very good reasons (performance is often another big one, as this article points out).

I will say its importance is somewhat contextual though. The 3 principal contextual factors that make this reason significant are: (1) A large, complex database/model, (2) A rapidly evolving data model (such as during development or prototyping and fast business requirement changes, and (3) a clear division between back-end and front-end guys on your team. Those are the most important factors, but they are not the only ones. When these factors are significant, you really want to encapsulate DB access from client work. The client should simply declaratively say what they want, and this enforced contract is maintained by the back-end team while the back-end team is free to implement the details, presumably better than the front-end team could. For example, if a query initially involves just joining a fact to a dimension, a front-end user might think they could simply write this in their ORM. But, if the model changes and this later requires 4 or 5 other joins to the get the needed data (because the modeler or business or whatever decided it to be so), then the client code can remain as is while the back-end team can rewrite the implementation details, maintaining performance and correctness of implementation as needed.

I tried doing this, but I found it hard to maintain as it needed more slaves to cope with the load of the complex functions. Also, it was harder to tweak the algorithm for things like "increment the score by x in case y, otherwise increment by z". I guess it's possible, but I just found it too hard to work with.

For me, it was easier to use Redis to keep a sorted set of article IDs. I updated the scores in my application code using Redis native functions, and then got a slice of IDs for a `SELECT ... WHERE IN (ids, from, redis)`.

It's probably because I'm a rubbish database engineer, so stick with my strengths.

> It's probably because I'm a rubbish database engineer, so stick with my strengths.

This doesn't make you a rubbish database engineer. Scaling databases is complex, so it makes sense to have your source of truth database do only the things it must do (consistent updates, indexing the ranking value if needed for queries) and things it's good at, and move everything else (calculating the ranking value) to another part of your system. If you can keep database load low enough to avoid sharding, that's a big win; you don't have to not grow your DB, just grow slower than Intel grows CPU performance / accessible RAM.

PostgreSQL Stored Procedures can absolutely be a pain to manage and change management can be tricky. Having said that these are not problems without solutions and there are some good wins if you can be organized enough to pull it off.

First off, plan and organize things. Yes, it would be nice if there were another level of abstraction, like Oracle's packages, but you can use schemas a bit more freely for namespacing. But a little time thinking instead of just coding can avoid much of the spaghetti. Many good devs will think through the rationality of their APIs, this is no different.

Second. Use a tool that lets you manage the code-bits like code. Migration tools like FlywayDB, etc. are not good for this because they sacrifice too much to the stateful nature of the database: you end up organizing things on a timeline rather than based on your structure. A better approach is like the one in Sqitch (http://sqitch.org) where you organize files according the organization of the database rather than the history of the database... very powerful approach. There are also verify steps which can incorporate testing: and testing the store procedures is easy... it's maging the data in tables across tests that can be hard and that is true regardless where your business logic lives (insofar as those tests can change data).

Third there are good testing tools available: The guy that made Sqitch also makes pgTAP (http://pgtap.org/). Well worth a look. I haven't done much myself with this one, but will likely do so soon.

I've written about this before [1] but i think the higher level message here is about Ruby on Rails developers expanding their database understanding and/or reliance. Given how cleanly ActiveRecord abstracts database porcelain, many devs who started with Rails don't lean on the database enough.

This, along with validators, table indexing and foreign keys are some of the more common shortcomings i see in devs who learned on Rails.

1: https://blog.benroux.me/be-the-better-rails-developer/

When you're starting out, you don't need these optimizations. Depending on what you're trying to do, you may never need them, and instead find that a caching solution is best.

Data integrity isn't an optimization.

What is the performance impact on INSERT when using tables with complex indices like the one mentioned in the article?

I'm not the author but I'd guess almost nothing. Basically you just need to calculate the result of the function and index it by the result, should be almost completely unnoticeable.

Unless it's a table with high write load. You're basically shifting the expensive part from read time to write time.

I'm maintaining a database cluster with 10s of millions of partial indexes, many of which contain a LIKE pattern. Each table in the cluster has several of these partial indexes. Because of all of these partial indexes, we became CPU bound. For the longest time, we thought all of the CPU was going to evaluating the LIKE patterns. Eventually, I decided to generate a flame graph of our database and discovered that we were only spending 0.3% of the time evaluating the partial index predicates. In turns out most of the CPU usage was evenly split between looking up the metadata of the index and in parsing the index predicate. So the index expression itself doesn't really matter, but having lots of these indexes can become an issue.

For version control, I put .sql files right alongside my other source code, and check them in. I might try snippets out interactively, but after a statement grows so long that it wraps a couple of times, I usually move it to a text file and edit from there. I run it in psql with the \i command:

   psql> \i test.sql
Usually I have one terminal window for the text file opened in vim, and another window for psql where all I do is keep running "\i test.sql".

When the code is polished and ready to go live, I turn it into a view or function and move it into a permanent file (views.sql, functions.sql, ddl.sql, make.sql, or whatever you want to call it). This file can be run as many times as I want, idempotently. This is because it starts off dropping all the views or functions that it later defines. You can try instead saying "CREATE OR REPLACE . . ." but PostgreSQL only lets you replace a view or function if its outer appearance stays the same (for views, the output columns can't change; for functions, the input and output can't change). So I find it simpler to just drop all of them ("DROP TABLE foo IF EXISTS...") and then create them (You usually have to drop them in the opposite order you create them, if any depend on one another). I wrap the whole file with BEGIN and COMMIT, so if there is any error the whole thing aborts. Actually I wrap it in BEGIN and ROLLBACK, and run it to see if there are any errors. Only when it's solid, polished, and error-free do I temporarily change ROLLBACK to COMMIT, run it, and immediately change it back to ROLLBACK. The file that gets checked into version control ends in ROLLBACK, for safety.

Later if I want to change something, I always do it first by editing the .sql file and then running it from psql. Then I check the changed file into version control.


For views and functions this is easy, because they don't keep their own data. You can drop and recreate them at any time. For tables, it's a little more complicated. You can't just drop and recreate them, or you'll lose all your data. So for tables I keep their definitions in a separate file. I run this file at an app's inception but seldom afterward. If I need to make a change to a table, I do it interactively, in the command-line tool ("ALTER TABLE foo ADD COLUMN blah ..."). Then I will probably edit the tables.sql file to reflect the change, just so it's in version control. But it's up to you. You can always pull the table definition from Postgres with the pg_dump command.

its often much easier to scale via app "farms" then to scale the db backend. Sometimes db procedures can lower the total db load, but be careful letting the db do too much work.

I thought the point of using an ORM like Active Record was to decouple the business logic and the database?

It can decouple you from the specific database implementation. The ORM can generate SQL that targets the database you are using by altering the generated queries to account for dialect and feature differences. For example, it can, potentially, make it possible to switch your application from using MySQL to PostgreSQL rather seamlessly. However, in my experience, this is rarely straightforward, and rarely done. It's not unusual for the application layer to become dependent on some quirk or unique feature of the underlying database, and sometimes that dependency is not obvious. The sibling reply is not far off the mark when it comes to using ORMs "in practice".


Ah that makes sense then. I guess I should have prefaced my comment with the fact that I have little Rails experience.

In one of my gigs, it was normal operating procedure to put logic in sprocs and call out to that from elsewhere. It makes so much sense - RDBMS engines are good at doing math, so we should let them do as much math as possible. ORM layers hide that away, and thinking about these problems in Ruby or Python work, but those languages just aren't as good at plain math as a database can be. I've done this type of thing in Rails and Express apps, and it makes complete sense there. You can even keep the sproc in source control by creating and changing them as migrations. It's tops.

"We can replace the timestamp with the primary key for the “recentness” calculation". Very different properties, not a great replacement.

I definitely agree that there may be some unexpected consequences of using the primary key in this way. It certainly won't behave like a timestamp and could distort the function outputs.

Would it be possible to substitute the integer count of seconds since epoch (or some other arbitrary base timestamp) and use that instead? Then you'd still have an immutable function to satisfy the index, but also you'd preserve valuable information about the relative time intervals between posts.

Yeah. If you wanted to delay making posts visible but create them in advance, they wouldn't have the same ranking as though they were just posted. You'd have to create them at the exact time you'd want them to start being displayed in order for the desired ranking to be correct.

The example was a social media site like Instagram, not a blog, which doesn't allow future posting. If you have different design constraints than the given example, obviously the implementation will need to adjusted to the new requirements.

No problem, just add a bot that creates a post every second, so that the pace of the primary key approximates that of the timestamp. /s

Care to expand on that?

>The integer value of a timestamp is just an increasing counter, ticking off each microsecond. The database itself has another form of increasing counter, the primary key.

This observation is either profound or trivial, depending on the mindset. The hundred-fold improvements it brings (along with the immutability) are not trivial, by any measure.

The idea is: Compute where your data is and yes data do have a location when computing has to be done with physics(computer).

Is it really needed to declare separate functions with all that boilerplate just to multiply and add a few numbers in an SQL statement?


what happens if i got 10 posts then before getting the second 10 one of the posts that i should get in the second query got 50 likes, it will got higher ranking and i will miss it, right?

The future is this but backed with super cool FDW for s3 and bigtable

he mentions 'logarithmic clamping' towards the end, can someone provide a link to what that is? google-fu failing.

I think ~0.442ms is very slow for a social media web or app, considering that it's only 1 query, maybe that example isn't the best for that kind of queries

It's unbelievable that an article can get to HN front page in which the author doesn't know the difference between Rails and Ruby. So annoying!

This is NOT Ruby for God's sake:

    SYSTEM_EPOCH   = 1.day.ago.to_i

People are downvoting I think because this is a rather pedantic point to be annoyed by. The tone is one of superiority, rather than pointing out the simple mistake.

Calling a rails function a ruby function is unlikely to invalidate the entirety of the article.

I know the point is somewhat pedantic, but I just don't really like to learn from people who doesn't even know what is what. Somewhat invalidates the article IMO, but I read it anyway, just with very carefully, trying not to learn anything just understanding the basic idea he/she want to tell.

I think they downvote because they are so ignorant I can't even believe it.

HN users will also down vote for tone and complaining about receiving down votes. Please strive to contribute constructively and in keeping with the guidelines.


Perhaps they are downvoting because of your own ignorance? The aforementioned code does depend on ActiveSupport (or similar library), but has absolutely nothing to do with Rails.

Thanks for explaining!

class Walkman < Troll

  def comment
    self.ignorance += 1

Pretty sure that's standard ruby.

Please be civil regardless of the behavior of other users.

That was more civil than outright calling multiple users ignorant.

Instead, I acknowledged demonstrated ignorance by cheekily posting a valid ruby class definition.

I am the original author, and I assure you I know the difference between Ruby and Rails. This was written over 3 years ago, and while I don't remember all of the details, I don't believe conflating Ruby level code with Rails specific code corrupts the article.

Really it's Ruby with the activesupport gem. Not a huge leap in simply calling it Ruby.

You are the only one cared to elaborate, thank you very much!

It most certainly is Ruby. It's Ruby with a library (ActiveSupport) added that augments with extra helper methods. When you write Ruby code that adds methods does it suddenly become not Ruby anymore?

It is Ruby in that it requires a Ruby interpreter to execute that line of code. It certainly isn't Javascript or Python. Is the article missing the context that these lines of code are being executed along side the Rails framework? Sure. Is it distracting from the main meaning of the article? I would say not.

For me it questions that the author is knowledgable enough to teach me any detail about Ruby.

That's fair, and definitely warrants skepticism, but was there anything in the article that you found counterfactual? I'm no RoR expert, but I do enjoy looking at how other people solved problems. Is there anything in the article that's counterfactual or that you've experienced causes headaches down the road?

I thought it was a truism that because of all the rampant monkeypatching, many ruby developers don't know how to use ruby without rails?

Surely if it's Rails then it must also, by definition, be Ruby?

Ok, now I know why! There are a lot of ignorant people here, wow!

This account has been posting many unsubstantive comments recently. Please stop and re-read the guidelines: https://news.ycombinator.com/newsguidelines.html. We have to ban accounts that continue on like this.

Speaking of this i'm looking at hiring someone to do some OSS rust/postgresql work. Namely writing FDW for s3 and bigtable.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact