At work, we have an internally generated report that was done purely in ruby. It worked, barely, munging numbers for about 1000 records. When we hit 4000 records, the report took about 6 hours to run in ruby.
We looked for some obvious optimizations like eager loading and some map reduce magic and implemented them. That brought us to a range of 45 minutes to 2 hours.
We decided that it was still not good enough. We moved ALL of the report into SQL functions (Postgres) and we have the whole function running in 27 seconds.
I agree with some of the top comments that maintaining this is a little trickier than ruby code, but some ideas we came up with in maintenance:
1) Write the SQL in a ruby file that has a class method of sorts that returns the generated function. This allows for version control to track the changes of the sql function.
2) Focus on tests at the interface level. We had great test coverage on the end result prior to moving to SQL. We were able to re utilize those tests to verify accuracy with calculations that are now in SQL.
This really isn't for everyone, but for those that want to try something new, it is worth it.
I once reduced the running time of a report from 45 minutes to 3 seconds (900x improvement) by moving the code inside the database.
If a programming language wants to stay fast it must eventually become a database. I realize this is an unpopular opinion but popularity is the wrong metric to judge by.
Admittedly the code I was moving away from was hilariously bad. Threaded custom build of PHP bad. Then again I haven't tried to optimise the code I wrote at all
You are absolutely right about fast code becoming a database, this is simply down to the query planner - it can try and do the least possible work for the data you actually have.
I recently used temporary per-transaction tables (CREATE TEMPORARY TABLE .. ON COMMIT DROP, basically CTEs that persist across statements, and that can be indexed) with json_to_recordset and turned a three-minute ruby ETL background process into a sub-1-second inline call.
CREATE TEMP TABLE is really awesome. Not really related, but I used it at my previous gig to optimize the unit tests. They would previously all use the same database that devs use during development, so scans over some large tables were particularly inefficient, and tests could break when someone modified table contents for some development task.
I implemented a small function where you could list the tables that you were working with, and it would create empty temporary tables with the same name in a namespace that takes precedence over the normal namespace in postgres' search_path, therefore giving you a blank slate to work with for the unit test, while the other tables were still populated with useful data. (Some of these data come from expensive infrastructure scanning jobs, so starting from an entirely empty database was not an option.)
Just curious - not sure if you were using Ruby like the parent above. I literally just built a library to back ActiveRecord models with temporary tables. It's very useful since you can use AR scopes off of your temporary table then (I had to do this instead of using CTE's because my DB currently doesn't support CTE's)
Just thought I'd share it in case it helps, or to hear if anyone else has a better approach to this. (warning: it's very experimental and not production tested yet). I couldn't find any other gem that would do this.
How did you resist the temptation to call it temptation?
Although I think temptable is almost the opposite of contemptible, therefore also good going in the pun department.
My code is all Ruby, and I ended up pushing all the work into SQL and just eventually selecting with find_by_sql into existing models for the results. There is possibly concurrent invocation vs updates and a race against itself, so it's also all wrapped with a serializable transaction and starts with a mutex lock on the controlling model.
Hah! I was sitting around trying to think of a clever name, but then I got tired of sitting around and just went with temptable. Had I thought of temptation I would have gone with that.
The approach I went with works really well for cases when you want to persist the temporary table through the end of the request (it works good for master/index type views that may have a lot of sums/counts and filter options available on it).
One shower idea that I had is the concept of a clojure version that instead of compiling to JVM bytecode, javascript or CLR, compiled to postgreSQL. I think that would be awesome: you could just run the same functions, seamlessly, from the database, through the web server, to the browser. And, whilst of course you need to know about the limitations of the database, it could be great for pushing code to the database seamlessly.
Unfortunately I don't think I have the skills for that, so posting here in the hopes that someone that can likes the idea :)
This is kind of what LINQ does. You write a query in C#. The structure of the query (like an abstract syntax tree) can be exposed to the query provider which can interpret it or compile it to the CLR or compile it to SQL or whatever.
A sufficiently advanced LINQ (http://wiki.c2.com/?SufficientlySmartCompiler) would do wonders in some cases, but I haven't encountered it. AFAIK, LINQ to SQL only knows about joins and some aggregate functions.
One thing that makes improving it cumbersome is that the semantics of many operations are slightly different in the database than in C#. For example, SQL Server doesn't do NaNs and infinities, supports a zillion string string collations, and its date types may not 100% map to C# ones.
Also, databases may run stored procedures faster than the SQL that LINQ genrerates on the fly because they can cache their query plans (most databases will detect that an on-the-fly query is equal to one run recently, though, so this may be somewhat of a moot argument)
At work we've got a LINQ query that gets passed around a few functions, eventually getting into a core of 60 lines of LINQ logic. Colleague verified that chaining selects produces different output, but gets ran at the same speed (Chain selects in order to somewhat declare variables, Select(x => new { x, y = <query-x> }) then you can Select(xy => new { use xy.y multplie times }))
Sometimes I think I should just be using sql.. (which we do on other projects)
Yeah, I've been pretty happy without ORM in node.js, I even wrote a semi-nice wrapper so I could turn template strings into parameterized queries. Made writing a bunch of migration scripts a cakewalk.
Sometimes it's really just easier to write SQL directly.
I'm not too familiar with ZODB, but it looks like it tries to impose OO on a database, while in reality relational model works best with data, so making programing language able to interact with data that way would be better. I think something like JOOQ[1] is close to that.
In order to get good performance you want to minimize number of back and forth requests over network. So instead of making a request to obtain list of items, and then fetching each of the item one by one (so called N+1 select issue) you will get a better performance if you make the database send only the data you want, nothing more, nothing less.
> I'm not too familiar with ZODB, but it looks like it tries to impose OO on a database, while in reality relational model works best with data, so making programing language able to interact with data that way would be better.
ZODB itself is essentially a transactional single-object store, where that "single object" is usually the root node of an arbitrary (possibly cyclic) object graph. Storage depends on the backend, nowadays you'd usually use file storage for development and relstorage for any kind of deployment.
It doesn't have any query language, Python is. There are no indices beyond those that you explicitly create (eg. a dictionary would be a simple index, but other packages provide more complex indices, like B-Trees - the nice thing here is that anything that goes into the DB is fully transacted, which removes a lot of headache that you'd have with other solutions (eg. flat XML files)).
ZODB thrives when your data model is too complex to easily or efficiently map into SQL, when you don't do or need efficient complex ad-hoc queries and when 180 % performance and concurrency (although this was recently improved through the MVCC reimplementation) isn't the highest priority. Since it's pretty good at caching (much better than most ORMs) performance usually doesn't suck.
More like a database toolkit. Performance depends on application but was clearly never the top priority. (Can still be pretty impressive in comparison due to the rather effective instance caching)
Yep, the K interpreter is pretty mediocre (it is very small and the language is inherently pretty fast, but the interpreter is otherwise not very advanced¹) but being directly integrated with a very optimized in-memory column-store database means it smokes anything else. Even the massive engineering effort in a JVM+RDBMS simply can't compete with running K directly on memory-mapped files.
1. I believe they added SIMD support a year or two ago and got a massive (2-10× IIRC) speedup.
> This allows for version control to track the changes of the sql function.
You... realize you could just put the SQL itself in git, right? Not saying there aren't potential benefits, but "it can go in version control" isn't one of them.
Absolutely! We use rails. I just mentioned doing it in that method as opposed to throwing it in a migration. That migration shouldn't change in version control. We wanted a history of the changes and an easy way to access it in migrations.
I suppose you could also execute "REPLACE FUNCTION" and redefine the inside a new migration each time you want to change it. The downside is that you end up with the whole function in new a migration every time you want to change if. But if you're not updating your stored procedures very frequently (which is usually the case, IME), it probably wouldn't be too bad.
At work we use a rather idiosyncratic tool called RoundhousE to deploy our SQL. It treats sql labelled as a stored procedure differently, and will run the file on every deploy, as opposed to scripts that alter the schema, which are only ran once, and may not be modified.
You don't need a tool for this. If you store base data in one schema, and functions/views in another, you can update this second schema willy nilly without worrying about damaging your data.
Just start the second file with DROP SCHEMA IF EXISTS schemaname CASCADE; CREATE SCHEMA schemaname;. Then, make sure any functions/views/permissions built on top of this schema are also contained within this file.
You are right but a Rails migration would provide a way to deploy to the database. Far from ideal though, because you could have many migrations with that SQL code, one for each version of the code. Not sure what's the best solution in this case.
I'm not a database guy, but I have to ask: you're talking about 4000 records, which is essentially nothing.
Why not just load all the data from SQL into a DataFrame in Python or R and do the processing there? I've written some pretty big R DataFrame crunching scripts, including very wide-window moving averages operating on tens of billions of records, that run in less than five minutes on my laptop.
Hell, I have a web app built with Flask+Pandas, running on a single t2.small instance, that does complex queries into a 6000 records big chemistry data set based on user input of selectors. It can serve thousands of simultaneous users without noticeable slowdown.
4000 records with a one-to-many join (the many part capped at 240 records). The aggregations are done mostly on that join table.
In our original ruby process, it would return just that, 240 records per join and aggregate on that. Our SQL process aggregates that in a stored procedure and returns the flat 4000 records.
We really didn't want to introduce another piece of technology to this stack, hence we didn't look at R or Python or Elixir. We did consider storing all of this inside elasticsearch for a hot second but keeping things in sync at the correct times seemed sub optimal and would add a layer of complexity we were all uncomfortable with.
ONE join? A join is usually just a linear scan through one table, doing hashtable lookups on the other table. If there is no hash table, at the worst both tables have to be sorted on the join key and scanned linearly.
The absolute worst case of a "join" is a cartesian product which runs in quadratic time. 6000^2 = 36000000, which should still easily finish in ~1 second.
You can install R or Python into a PostgreSQL server as a language extension and then use it to define SQL functions. This is really useful to define functions which can reduce the size of the recordset transmitted across the network.
You wouldn't likely put the whole app to run on the database server, but typically PostgreSQL makes good use of the available RAM and disk, but leaves lots of CPU cores idle. So running some R and Python on the PostgreSQL server is an overall improvement in performance.
P.S. Yes, I know that PostgreSQL 9.6.1 will use multiple cores for aggregate and join queries on large tables, but few people have upgraded to that yet.
My immediate guess would be that they already knew SQL and had the data in there - and both R and pandas have a tricky initial learning curve if you're not familiar with dataframe style operations (hell, I was a pure math major some years ago and I have to contort my brain to do dataframes whereas I long since got comfortable with SQL).
Oh absolutely. I've tried doing complicated logic with (MySQL) stored procedures. It's painful. Definitely send the queries from a sane programming language.
One way to think about this is that you just want to take control back from the ORM of the exact database queries sent down the pipe. Which, BTW, is why every ORM worth its salt has functions for dropping down to plain SQL - teach that as best practice for doing computation in SQL queries, not stored procedures.
Most ORMs will not do the level of aggregations you want or need without writing raw SQL within your ORM. If you are already writing raw SQL in your ORM for aggregation level computations, you might as well create a stored procedure, make it IMMUTABLE (if you can) and gain those performance benefits.
At the end of the day, I would just use the tool that has the most comfortability with the team on the project. You can obviously start with an ORM, but there may come a point in time where performance needs a little boost.
What are the real performance benefits of stored procedures over sending the hand-written queries over the wire through your ORM? I suspect small to none, and at the cost of having code that lives in the database rather than in version-controlled code.
code that lives in the database rather than in version-controlled code
If you think you can't version control stored procs, then you are doing something badly, badly wrong.
What's the difference between a "script" which is a text file on a filesystem at the end of the day, versus a "stored proc" which is a block of text in a database, loaded off a file in a filesystem?
And then you have to make sure your deploys are atomic so that the app code and stored procedures don't get out of sync on failed deploys...
Or your stored procedures now become one more thing that has to be backwards compatible, in addition to your DB schema...
I've run into all of these problems with stored procedures, and they SUCK. Good deployment tooling can solve them, but it's a lot of effort for questionable reward.
It's a small amount of effort for a thousand-fold increase in performance and not having to re-implement the same functionality in every app and in every language that connects to that DB. YMMV.
The argument was between using SQL in stored procedures or SQL in queries sent by the application, so you don't have to re-implement anything (just share the SQL queries between apps), and I very much doubt you'll get a thousand-fold increase in performance.
One performance benefit of stored procedures in Postgres is that you can perform a multi-statement transaction in the stored procedure, saving the network overhead of multiple round-trips between the application and the database. This may not be a common occurrence for many applications, but it is a legitimate performance benefit when the need arises.
Note that the network protocol allows pipelining for multiple statements (including transaction control ones), too. Unfortunately only few client libraries (e.g. pgjdbc) make use of that :(
You can literally do version control in the database as well. Upgrade and downgrade your entire scheme at will. Unfortunately the only open source software I'm aware of that does this is Archiveopteryx
Honestly, it's yak shaving at that point. Like the article pointed out, you can create an index on a stored procedure that is IMMUTABLE. That will have some effect on the performance of the query.
There are other patterns you can use to track via version control the changes for your stored procedure.
Always pure SQL, at least as you're using a DBMS. Never anything else.
SQL can stretch to full Turing completeness, so it'll always be good enough for that. And between JSON support and PLPGSQL most everything can be done in Postgres.
Unfortunately, "pure sql" usually means passing in strings, which leaves you without good tooling to manage the code. This can greatly slow down the speed of delivery. It's great if you are in an ossified section of the code/database, but it can be very brittle (or expensive to validate) otherwise.
Modern RDBMSes can handle most of the ORM use cases at their end. What's wrong with VIEWs, for instance?
Plus it solves the whole schema impedance mismatch thing to just leave it up to the DB.
ORMs aren't that great. Having spent the past five years primarily with an object database that is also fully SQL (ISC Caché) I find myself preferring SQL more and more. Even when you have the ORM within the bloody database I'm going SQL.
The string thing isn't an SQL problem, it's a tooling one.
The only tool per se should be pgAdmin or an equivalent. And in your application the only "SQL" that should be written is SELECT * FROM viewforthisfunctionality and select runupsertwithjsonobjectthatpgorplpgsqlcanreadjus tfine(json)
Have you considered a hybrid approach, like doing the heavy lifting/querying/calculating in Postgres, but the final formatting of the report (something that might change often, or be very finicky to implement in pl/sql or in the db) is left to ruby?
That's exactly what we did. We return an anonymous table of records in Postgres that are then passed to ruby to render an excel file. The SQL query takes 27s on average. The ruby view generation is fast as well but I haven't profiled that part of the report.
> That's exactly what we did. We return an anonymous table of records in Postgres that are then passed to ruby to render an excel file. The SQL query takes 27s on average. The ruby view generation is fast as well but I haven't profiled that part of the report.
What's an "anonymous table of records"? Do you mean it's just a result set (i.e. query)?
Yea. It returns a table of rows, where the return columns are defined in the stored procedure. I may have used the wrong nomenclature. I was trying to differentiate between a table in a database schema versus one defined in a function.
I took a 20-page multi-threaded ruby code that took 2-4 days to complete and changed it to a 1-page MySQL stored procedure with some @variables. Brought it down to 4 minutes to run. And to top it off, reduce the overall server load from 96% to 20%.
I kept that CPU usage chart and added it to my CV as it was quite impressive to look at.
Correct usage of ORMs if it's not in the database is another important and easy thing to get right. This often gives a 10-1000 times performance improvement just due to the avoided "ping pong" game by using eg. query annotations that tell the ORM which related objects you'll access.
We didn't make it immutable and we didn't add an index to it. Those are our next action items to help improve the performance of it.
Honestly, instead of looking into more performance gains, I wanted the team to get a huge win and for the team to look good in front of senior management :D.
If you want to optimize for looking good in front of senior management, you would put a sleep() in there to make it run for 5 minutes. Then when senior mgmt comes back and complains that 5 minutes is still too much, you remove the sleep(), bill a few hours to time tracking, and praise the new performance improvements that you just deployed. ;)
This is an example of what I call "embracing PostgreSQL". Roughly speaking it means ditching the ORM layers and using a simple API to run any SQL query that you can think up. Then your application can leverage the many features that exist in PostgreSQL and not just treat it as a dumb SQL server.
Using functions as an intermediary to raw queries, also decouples the SQL details from the application. What I mean is that if the application calls an SQL function to read, update, insert data then you can revise the SQL functions independently of application code. In other words, if you refactor an SQL function, you can deploy the change to the database without rebuilding and deploying the application.
This is especially helpful in companies that have an onerous verification and validation process for application changes, because often "performance improvements" are classified as administrative changes which can be done with less bureaucracy.
So, embrace PostgreSQL as a permanent part of your application and reap the benefits of its many powerful features.
Our application uses JSON throughout, and i've found the JSON functions in postgres to be invaluable!
I recently refactored several functions which kind of organically grew from several different functions each with it's own SQL query into one big query.
Not only did performance increase by a significant amount, the whole function is basically just "run this query and return the result" as the JSON functions in postgres allow me to leave the actual processing in the database and to not have to do the always annoying "loop over the 30,000 rows that were returned to build a JSON file" thing which is never fast.
Because they want a RESTful endpoint, not JSON-over-SQL. Would you open that up as a public API with free registration? How do you do access control using OAuth? How would you do stuff like convert images to thumbnails, extract text from PDFs or send an email?
Both of those are adding a layer over Postgres; they are still "something like Django Rest Framework". Whether it's written in Python, Haskell or Go is not really the point I was making.
There's a lot of substance to the stack chosen. The performance Haskell or Go offer over Python let alone Django, a fat framework, make them far better picks for thin wrappers around an RDBMS.
As someone who's thought about embracing over using the ORM, here's a couple issue that come up:
- Version control. The SQL functions end up just being data in your DB. This feels like an anti-feature in the world of continuous integration and code review.
- Lack of composability. An ORM gives you the ability to easily compose filters. This lets me write functions like "take this query, and make sure it's properly filtered down to one user's data". Because of how SQL works, each query needs to be hand-crafted, and it's hard to make parts of it generic.
- Rewriting business logic. I have business logic in Python already, but if I want to query off of that in the DB in SQL, now I need two implementations. You can sort of get around this in an ORM by annotations.
- I'm not sure what the developer tooling environment is like. PyCharm is pretttty nice.
To be honest, you citing that it's easy to get around the validation process when using Postgres is a major red flag. Why would I want to circumvent code review for things that are most undoubtably code?
The right way to have composability in a RDBMS is with things like parameterized views or table-valued functions. This would fall under embracing the RDBMS, leveraging its modern capabilities.
Version control. The SQL functions end up just being data in your DB. This feels like an anti-feature in the world of continuous integration and code review.
Functions can be treated the same way as migrations. With PostgreSQL you have
transactional DDL, so you can atomically deploy and rollback functions. There
plenty of different script packages to help manage this. At the end of the day
it’s deploy and rollback SQL scripts in text files, which are easily managed with
version control.
Lack of composability. An ORM gives you the ability to easily compose filters. This lets me write functions like "take this query, and make sure it's properly filtered down to one user's data". Because of how SQL works, each query needs to be hand-crafted, and it's hard to make parts of it generic.
The trade off here is that you’re trusting the ORM to produce an optimal query
(or queries) for whatever you’re trying to do compared with writing the SQL
yourself, with knowledge of your schema. Like any trade off, its one you need to
decide for yourself.
You may have filtering on the data that you’re returning from the database
that you chose not to do in the query itself. At this point, you should just
be dealing with data structures. I don’t think this would be different whether
you’re using an ORM or not.
Rewriting business logic. I have business logic in Python already, but if I want to query off of that in the DB in SQL, now I need two implementations. You can sort of get around this in an ORM by annotations.
I’m not sure what you’re getting at here, so I’m not sure if this addresses your
concerns: If you’ve got clean separation of your code, you should only need to
update a single location for any given piece of logic. Single source of truth, and all that.
I'm not sure what the developer tooling environment is like. PyCharm is pretttty nice.
SQL scripts are generally text files, so anything that supports some kind of SQL syntax
highlighting is likely all you need. I suspect PyCharm will work. The Jetbrains folks are
pretty sharp.
Why would I want to circumvent code review for things that are most undoubtably code?
Since you can have your scripts under version control, you can likey use the same code
review tools that you currently do. You can also use libraries such as pgtap[0] to test
your schema and functions indepedent of the rest of your application, if you so choose.
These are all issues I've dealt with in a large production PostgreSQL environment. They're good things to think about, and you're right to consider them. At the end, you need to make a decision you and your team is comfortable with.
I had to work on/with a rather large backend which used this approach. Most of the logic was in stored procedures, with python/sqlalchemy used as a glue and to process requests.
It was absolutely horrible to develop and a nightmare to test.
For one, there's nothing approaching ActiveRecord migrations in stability and ease of use, so schema changes on dev/test/deploy were a huge pain. Then, you can't really work in normal "auto-run tests on save" mode as you would with saner frameworks, syncing source from git to postgres is needed (the principal dev altogether preferred to edit directly in pgadmin and then sync). Then, SQL tends to be a lot harder to read than python/ruby, especially so if not well documented and with many join tables. And finally, you can't scale out the heavy stuff, you are limited by your master instance basically. 2/5, would not recommend.
I made a very broad assumption based on my past experiences with engineers and I apologize if I offended you.
Most engineers I have met in my career have really depended on an ORM to do all of their heavy lifting. They can write some of the most eloquent ActiveRecord queries but would be unable to create the same functionality using raw SQL.
Again, deeply sorry if I offended. It was not my intention.
Yes, that's definitely the norm. Having some experience with Rails, I'm pretty sure that 90% of Rails developers have never written any SQL. If they need to access the production database, most of them would use the rails console on the production app server instead of psql etc.
And I can sympathize. For example, I wouldn't even know how Rails maps a many-to-many relation to SQL tables, so I would have to use the rails console myself to traverse such a relation.
Although Rails is very reasonable with its mappings and they are easy to use directly, I prefer Rails for db manipulaton, it is usually much easier and faster to type 'User.find(1).favorites << Product.find(10)' than an equivalent SQL.
Very true, but I've found that it's easy for those ActiveRecord abstractions to kill performance (I did this on a dashboard that was performing some serious calculations across hundreds of millions of records, often joining other similarly sized tables)
I don't know if it's the norm or not, but leveraging an ORM without knowing SQL is a thing. ORMs have so many benefits (query parameterization just to name one). That and people can stay in their comfort zone language (Ruby, Python, Haskell, etc) and let the ORM do the heavy lifting instead of having to dip down into SQL.
Those that are really good with an ORM can generally go into SQL with ease.
The above is not meant to be objective, just my general observations.
>> With the correct abstractions, you can definitely run tests on save.
You definitely can, but you'll have to roll your own watcher and sync, there's no Guard equivalent that I'm aware of.
>> SQL is rather readable these days.
Used to think so as well. But then I had to read some pretty nasty sp's which calculated discounts depending on what user already has, and geolocation, and time of day, and day of month, and tons of other stuff, and the return value had to be a complex hierarchy of categories/products. So what would be 2 pages of python turns into 500 lines of SQL with some statements 80 lines long, plus it was still a pain to json-encode in python. The only positive thing about it was that I wasn't the one who had to write it.
I don't quite agree it's the tools only, my bigger gripe was total lack of scalability. Basically, even for moderately complex calculations, you can cache a lot of slow-moving stuff (such as products, categories, etc) to redis and do most of stuff on web_worker_{nnn}, using potgres only where you need ACID guarantees (eg orders, payments, etc). When you do everything on your master instance, any heavy calculation can become a problem pretty quickly.
> When you do everything on your master instance, any heavy calculation can become a problem pretty quickly.
SELECT-only reporting code like this is normally run against a query slave, and with a bit of automation you can stand up one of those pretty much as easily as another webhead.
Then how is it better than simply using another webhead? In any case, we need to do heavy calculations AND transactionally update some state. It's possible to do this on slaves I guess with some trickery, but then what's the advantage over doing it in the app servers?
> Then how is it better than simply using another webhead?
Data locality.
> In any case, we need to do heavy calculations AND transactionally update some state.
In that case, I'd have the app server connect to both master and slave, run the heavy calculations on the slave the same way, and then send the updates to the master.
Cool demonstration of IMMUTABLE functions (TIL), but there's a reason people have moved away from stored procedures/functions in RDBMSs in favor of moving this stuff into much slower application code -- you now have business logic living inside of the database. From there, you have to now think about how this affects:
- Testing
- Deployment/Migrations
- Day-to-day tooling to work on it
- Switching costs (mental overhead, using multiple languages)
- Documentation
- Probably more stuff
The ruby code is just like all of your other ruby code and can be grokked and worked on by anyone who is up to speed on your ruby stack. The DB function meanwhile requires jumping through lots of hoops and will always feel like a second class citizen relative to the vast majority of your application code, unless you do a lot of work and maintenance (unlikely to happen without feeling distracting if this code is a relatively small slice of your code base.)
In some cases, this can be worth it for massive optimization wins. But the article doesn't really touch on why this might be a bad idea.
I do think it highlights the potential of systems like MADLIB [1] where you can push the data crunching into the database, while having that be abstracted away in your primary programming language (in this case, R.) One could imagine a world where you wrote the ruby code and it was somehow mapped in to be run on the database. But that's not the world we live in today unfortunately. It's really a shame, because people using a database as capable of PostgreSQL but with a fairly dumb ORM on top are under-utilizing its potential, for sure, but there's not really a great path afaik to leverage it. (I remember years ago MS introduced running C# inside of SQL Server, I am not sure if that ever ended up panning out into being something smarter than the dumb "upload code to database manually" model we are used it.)
I may be in the minority here, and I'll happily accept that if I am. But my rule of thumb is that anything that can be done inside the database should be done inside the database.
But I'm a python/C#/.Net person.
The stored procs/functions go into a repo just like everything else does. It's not that big a deal. Alembic for migrations is pretty solid. And if you're dealing with a large code base, you log changes inside the DB itself. Use the database for version control.
It's really manageable. It requires that your dev team really knows SQL. But I don't think that's a bad thing.
It's fast, gets the job done, and if you're running a language that doesn't make it easy to deal with concurrency, it's nice to just let the database handle transactions.
For everyone who isn't facebook or google, this is the way to go, in my experience.
Let the db do the hard work, let your preferred language be the middleman, and have a reasonable front end framework do the presentation.
I can scale a web app to millions of users on a couple hundred bucks a month of hardware with this approach. You can too.
> my rule of thumb is that anything that can be done inside the database should be done inside the database
For non-distributable (or at least non-multi-master) DBs, you are often using your most expensive and least distributable resource to do your cheapest and most distributable work. My rule of thumb is that unless you need atomicity or the performance (as profiled) makes a real difference, do it outside the DB. Also has the side effect of avoiding system lock-in (at the expense of language lock-in which I think is ok).
Granted, it's subjective, but I've had the most trouble scaling DBs than other systems, and it's usually CPU/mem of business logic pushing the boundaries.
I don't often find myself in a situation where I don't need need atomicity for the work I do. So I'm speaking from a point of view that reflects that.
Even if I didn't need it for a project, I would probably start there because, you know, that's the hammer I have, so SQL is the nail I'm going to start with.
I haven't had problems scaling DBs though. DB throughput has never caused a problem for me. What has caused problems is business logic put in what I consider the wrong places.
Business logic belongs in the database structure wherever possible because that's the single source of truth for your data. That's where all of your controls for data integrity belong.
For any app that has longevity, you are going to have multiple methods of accessing the data. There will be APIs and crappy junk that wants your data. You have to account for this. There will be users logging directly into your database. Sometimes it will be your boss.
Start with an ACID approach to your CRUD app, and you probably won't go wrong. At least not soon. And you have options for scaling when you need to. You can scale vertically or horizontally or both.
> For non-distributable (or at least non-multi-master) DBs, you are often using your most expensive and least distributable resource to do your cheapest and most distributable work.
I see your logic, but I think this is misleading, and often the reason why people do inefficient things.
The pattern I see regularly is if the database does more thinking, its planner will diminish the amount of actual work; it will touch fewer rows from disk, and less data will result sent over the network and in fewer round trips. Overall for any extensive report you're likely to be paying significant cost to distribute the work, often more than actually doing the work.
Of course, as you say it's subjective. Scaling databases becomes necessary at a point, but it may be because of developers not embracing them; I think this is why NoSQL has gained in popularity.
> It's really manageable. It requires that your dev team really knows SQL
I'd question the credentials of any developer who doesn't know SQL. It's an essential skill across the entire IT-indutry. Not knowing it means you're obviously not serious about your job.
And here you're talking like programmers not knowing it is a normal thing... Are we working in the same industry?
Feel free to question my credentials, then. I'm 44, I've been a dev all my life, and I've never worked with SQL. I've worked with distributed in memory data grids, strange custom databases specific to trading systems, and many other data storage systems, but I've literally never written a single SQL query that has gone to test or production. I've also written tons of code that hasn't stored data, including my current project (an IDE).
The chain of comments leading up to yours is interesting. One person said that if you've got a relational database you should push lots of work down to it, which they say requires lots of SQL knowledge, and they think having that knowledge is good. Then another person replied and said you're not a real programmer if you don't know SQL. Then you pointed out that it's entirely possible to be a developer and never touch relational databases.
I'd like to support your comment, and argue for a more restrictive claim. Namely, a developer is more valuable the better they know their tools. If they're working with a relational database, then it's good for them to learn more about how to use it. However, if you're running a team that is building some software system, you ought to make technology decisions which make sense given the skills of your team members; so, if the team's SQL knowledge is weak and there aren't compelling performance reasons to push lots of work into the database, it's probably not the right choice.
I appreciate your tone and like your comment (read: upvoted). But I want to respectfully disagree with it.
I don't agree with the part of your comment that says, essentially, work with the tools your team already has.
There are some cases where you have to choose the technology that makes the most sense and your team has to get with the program. If you're storing data in any way, you have what I think of as an obligation to choose the best technology to store it.
I know and have worked with people who would write everything to a csv file because they know how to do that and prefer to do that. And toss it on a server somewhere with no redundancy or backups of any kind.
We all come from different places and have different jobs. But some times you have to suck it up and learn a new thing to be effective in this industry.
Sometimes a CSV file actually is the right answer. Sometimes a database of some flavor is the right answer.
The point I was trying to make and didn't do such a good job of was that if you're dealing with a web app, a database is probably the right answer, and if you're using a database, you should maximize what it can do for you.
I should have made my comment more explicit. I'm talking about web development.
There are a ton of jobs which my comments don't apply to. And yours sounds like one of them.
There are are entire areas of my work that also have nothing to do with collecting or storing data, and I recognize that SQL has no place in that work.
I was speaking to the bulk of my work and assuming (perhaps wrongly) that we were talking about similar things.
My bad. And you are totally correct about life being more than CRUD apps.
To be clear, I was replying to josteink who said: It's an essential skill across the entire IT-indutry. Not knowing it means you're obviously not serious about your job. I thought your comment was totally reasonable :-)
Sometimes when you question you'll find that the person really has the chops :-)
I'm sure you'll agree however that for every one of you, there's ton of developers who are less programmers and more assembly line workers gluing framework pieces together :-)
There is a large span between "knows SQL" as in "can write and read basic SQL query" and "really knows SQL". And I bet there are tons of developers who never have had the need to touch SQL or even any RDBMS at all. It's an essential skill in parts of the industry, it has no practical relevance in others.
You can write your stored procedures in a number of scripting languages. Out of the box PostgreSQL supports Tcl [1], Perl [2] and Python [3]. There's also a general mechanism to register other languages [4].
On github I found [5], which is embeds Ruby in PostgreSQL. I haven't used it and don't know how mature it is.
Using the same scripting language in your stored procedures and application code might lessen the impedance mismatch. It doesn't solve testing, deployment/migrations and tooling.
That pl/ruby embedding is unmaintained for years and doesn't currently build.
To me the most interesting current extension language is javascript, via PL/v8: http://pgxn.org/dist/plv8/doc/plv8.html - now you can have isomorphic code at all three tiers of your web stack.
While you're right of course, I think you also just described why people have been switching to SOA (cough, I mean microservices).
If you encapsulate the database and related logic behind its own service interface, then as you get big enough to care about query performance, then only the team working on that microservice needs to care about how complex/tricky/whatever the optimizations become!
That certainly sounds right -- though I guess I still am very skeptical that the complexity tradeoff there is the right one (at least until you hit the scale that dictates it.)
I believe that many people don't use stored procedures/functions because of lack of knowledge/know-how, difficulty or needing something done yesterday (which in itself is related to the previously), unfortunately.
I say this because I've seen it happen multiple times.
We are using Liquibase to manage the SQL code in the database as a separate GIT repository from the several applications that use the database. We have a dev database, and are beginning to use pgTAP for SQL unit tests. Code review for the SQL comes from BitBucket. GitHub works similarly for code review. The SQL code can be versioned and deployed as changesets either ins synchronization with an app change if that is appropriate, or separately. If you also embrace the concept of "database refactoring" then there is less need for migrations in lockstep with an app deploy.
So, putting some of the business logic into the database can be done using modern development best practices.
Yes there are switching costs and it means that developers need to really learn SQL and PostgreSQL's advanced features but it pays off. And if you break the lockstep of db migrations and application deployments then you can easily have people specialize by language.
In fact, I would say that the era of the DBA is fading and what we really need are devops developers with a strong understanding of PostgreSQL who can do db architecture, db engineering, db design, and SQL coding/refactoring.
Probably my perspective is different from a lot of people because I worked many years in large ISPs where Network Engineering folks designed everything, built it in test labs, verified everything, documented how to deploy and then handed it over to an Operations team that actually set up the routers and installed the configuration code. I don't see coding as the top of the pyramid. Instead it is Engineering Design and Architecture of applications that we should all aspire to.
I would be interested in seeing what your actual proposed solution to the example problem is here. You seem to be implying that we should pull back the full list of posts and order & limit in memory. This might make sense for some scales and workloads but it definitely wouldn't be my first or second solution for the given example.
today, lambda (speed layer + batch layer) or kappa architecture (kafka streams) -- ymmv. not sure if I would build that infra out just for this single problem, but generally once that infra is in place it can be applied to many problems including this one.
edit: (assuming i can't get it working with a jury rigged cronjob and the usual dumb solutions that will still use the ruby code. i'd try a bunch of those types of solutions first obv.)
You can run R code in SQL Server, Oracle, Teradata, Hana and Vertica.
All of these databases kick the crap out of HN darlings like Redshift and BigQuery. Special mention goes to Vertica which is free up to three nodes and/or 1TB data + 1 TB in flex tables. It is, like the rest, analytical AND transactional. Has proper admin features, proper edge case coverage and a truck load of features. It also has a very sensible licensing.
At my last job we finally got our front-end devs to move all client side db work to calls to stored procedures or views. After doing so, nearly all of them fell completely in favor of this method. This is escpecially true with large, complex DBs or data warehouses, where the model is subject to change and reliance on back-end devs better knowledge of the database is more important.
That said, despite some of the tradeoffs - for example with testing, or migrating to a new db (which is extremely rare anyway) - it is ultimately far better for client code not to have to know back-end implementation. I think things like graphql are finally making this abundantly clear.
What I have been dying for is a real time DB BAAS that is ACID compliant, preferably relational, has a simple rest api, and allows me to write stored procedures that I can call from my client code. Horizon is probably the closest thing out there. Right now using firebase, and sick of the absurd amount of client side code I have to commmit to to pull data from various parts of the database. Requires huge amounts of overfetching, unecessary coupling of client-server code, horrible consistency support, overly loose type structuring, etc.
If somebody writes a postgres version of horizon I will pay big money to use it :)
That said, the real-time aspect is what I'm really dying for, plus a company supporting a BAAS. The ease of firebase is so nice, and you can make really cool real time apps with it very fast. I'm not sure if switching to a relational model would make impose some technical limitation that json stores don't (besides obvious complexity). Postgraphql + socket.io is kind of what i'm thinking is the start
I had forgotten about this actually. Thanks for the reminder :)
Curious if anyone has tried using this for real time apps and what their experience was
You might want to take a look at dreamfactory[1]. They have a nice app that wraps all manner of services to become a self-hosted MBaaS. I've used it to put a REST interface on my postgres database, views, and stores procs.
You cite this as the primary reason, but how much easier would this have been without stored procedures and especially views? I would think harder if anything.
It can be the primary reason, but it is generally one of a few very good reasons (performance is often another big one, as this article points out).
I will say its importance is somewhat contextual though. The 3 principal contextual factors that make this reason significant are: (1) A large, complex database/model, (2) A rapidly evolving data model (such as during development or prototyping and fast business requirement changes, and (3) a clear division between back-end and front-end guys on your team. Those are the most important factors, but they are not the only ones. When these factors are significant, you really want to encapsulate DB access from client work. The client should simply declaratively say what they want, and this enforced contract is maintained by the back-end team while the back-end team is free to implement the details, presumably better than the front-end team could. For example, if a query initially involves just joining a fact to a dimension, a front-end user might think they could simply write this in their ORM. But, if the model changes and this later requires 4 or 5 other joins to the get the needed data (because the modeler or business or whatever decided it to be so), then the client code can remain as is while the back-end team can rewrite the implementation details, maintaining performance and correctness of implementation as needed.
The article makes a good point that if you want ordering by a computed 'ranking' to be performant, your database will need to store and index that value.
But then it takes another leap and says that this means your database should actually compute that value. I don't see why that's necessary, and it would bring with it all sorts of issues that other people have mentioned (maintainability, scaling, testing, coupling).
It seems you could equally solve the problem by adding an additional indexed 'ranking' column, and computing and storing the value at the same time you insert the row [1]. It seems that's essentially what Postgres would do anyway.
Also, I'd note that the algorithm here is very simplistic, and even so, the author had to make a functional change in order to get it to be performant in Postgres. You can't just substitute an ID for a timestamp, just because both are monotonically increasing. The initial Ruby version of this algorithm treats a given 'popularity' as a fixed jump in some amount of time (e.g. 3 hours). The Postgres version treats it as a fixed jump in ranking (e.g. 3 ranks). Those are not equivalent.
[1] This does assume that you can control all the database manipulation from a centralized place.
> It seems that's essentially what Postgres would do anyway.
Yes, but if Postgres does it, it's literally impossible for the developer to screw up, whereas manually putting the ranking in the table can easily cause inconsistent records if you don't know what you're doing.
And even if you know what you're doing, next month's code change might not remember that it's important to update the ranking with the rest of the record (unless you're calculating it in a before_save hook or something).
I tried doing this, but I found it hard to maintain as it needed more slaves to cope with the load of the complex functions. Also, it was harder to tweak the algorithm for things like "increment the score by x in case y, otherwise increment by z". I guess it's possible, but I just found it too hard to work with.
For me, it was easier to use Redis to keep a sorted set of article IDs. I updated the scores in my application code using Redis native functions, and then got a slice of IDs for a `SELECT ... WHERE IN (ids, from, redis)`.
It's probably because I'm a rubbish database engineer, so stick with my strengths.
> It's probably because I'm a rubbish database engineer, so stick with my strengths.
This doesn't make you a rubbish database engineer. Scaling databases is complex, so it makes sense to have your source of truth database do only the things it must do (consistent updates, indexing the ranking value if needed for queries) and things it's good at, and move everything else (calculating the ranking value) to another part of your system. If you can keep database load low enough to avoid sharding, that's a big win; you don't have to not grow your DB, just grow slower than Intel grows CPU performance / accessible RAM.
PostgreSQL Stored Procedures can absolutely be a pain to manage and change management can be tricky. Having said that these are not problems without solutions and there are some good wins if you can be organized enough to pull it off.
First off, plan and organize things. Yes, it would be nice if there were another level of abstraction, like Oracle's packages, but you can use schemas a bit more freely for namespacing. But a little time thinking instead of just coding can avoid much of the spaghetti. Many good devs will think through the rationality of their APIs, this is no different.
Second. Use a tool that lets you manage the code-bits like code. Migration tools like FlywayDB, etc. are not good for this because they sacrifice too much to the stateful nature of the database: you end up organizing things on a timeline rather than based on your structure. A better approach is like the one in Sqitch (http://sqitch.org) where you organize files according the organization of the database rather than the history of the database... very powerful approach. There are also verify steps which can incorporate testing: and testing the store procedures is easy... it's maging the data in tables across tests that can be hard and that is true regardless where your business logic lives (insofar as those tests can change data).
Third there are good testing tools available: The guy that made Sqitch also makes pgTAP (http://pgtap.org/). Well worth a look. I haven't done much myself with this one, but will likely do so soon.
I've written about this before [1] but i think the higher level message here is about Ruby on Rails developers expanding their database understanding and/or reliance. Given how cleanly ActiveRecord abstracts database porcelain, many devs who started with Rails don't lean on the database enough.
This, along with validators, table indexing and foreign keys are some of the more common shortcomings i see in devs who learned on Rails.
When you're starting out, you don't need these optimizations. Depending on what you're trying to do, you may never need them, and instead find that a caching solution is best.
I'm not the author but I'd guess almost nothing. Basically you just need to calculate the result of the function and index it by the result, should be almost completely unnoticeable.
I'm maintaining a database cluster with 10s of millions of partial indexes, many of which contain a LIKE pattern. Each table in the cluster has several of these partial indexes. Because of all of these partial indexes, we became CPU bound. For the longest time, we thought all of the CPU was going to evaluating the LIKE patterns. Eventually, I decided to generate a flame graph of our database and discovered that we were only spending 0.3% of the time evaluating the partial index predicates. In turns out most of the CPU usage was evenly split between looking up the metadata of the index and in parsing the index predicate. So the index expression itself doesn't really matter, but having lots of these indexes can become an issue.
For version control, I put .sql files right alongside my other source code, and check them in. I might try snippets out interactively, but after a statement grows so long that it wraps a couple of times, I usually move it to a text file and edit from there. I run it in psql with the \i command:
psql> \i test.sql
Usually I have one terminal window for the text file opened in vim, and another window for psql where all I do is keep running "\i test.sql".
When the code is polished and ready to go live, I turn it into a view or function and move it into a permanent file (views.sql, functions.sql, ddl.sql, make.sql, or whatever you want to call it). This file can be run as many times as I want, idempotently. This is because it starts off dropping all the views or functions that it later defines. You can try instead saying "CREATE OR REPLACE . . ." but PostgreSQL only lets you replace a view or function if its outer appearance stays the same (for views, the output columns can't change; for functions, the input and output can't change). So I find it simpler to just drop all of them ("DROP TABLE foo IF EXISTS...") and then create them (You usually have to drop them in the opposite order you create them, if any depend on one another). I wrap the whole file with BEGIN and COMMIT, so if there is any error the whole thing aborts. Actually I wrap it in BEGIN and ROLLBACK, and run it to see if there are any errors. Only when it's solid, polished, and error-free do I temporarily change ROLLBACK to COMMIT, run it, and immediately change it back to ROLLBACK. The file that gets checked into version control ends in ROLLBACK, for safety.
Later if I want to change something, I always do it first by editing the .sql file and then running it from psql. Then I check the changed file into version control.
---
For views and functions this is easy, because they don't keep their own data. You can drop and recreate them at any time. For tables, it's a little more complicated. You can't just drop and recreate them, or you'll lose all your data. So for tables I keep their definitions in a separate file. I run this file at an app's inception but seldom afterward. If I need to make a change to a table, I do it interactively, in the command-line tool ("ALTER TABLE foo ADD COLUMN blah ..."). Then I will probably edit the tables.sql file to reflect the change, just so it's in version control. But it's up to you. You can always pull the table definition from Postgres with the pg_dump command.
its often much easier to scale via app "farms" then to scale the db backend. Sometimes db procedures can lower the total db load, but be careful letting the db do too much work.
It can decouple you from the specific database implementation. The ORM can generate SQL that targets the database you are using by altering the generated queries to account for dialect and feature differences. For example, it can, potentially, make it possible to switch your application from using MySQL to PostgreSQL rather seamlessly. However, in my experience, this is rarely straightforward, and rarely done. It's not unusual for the application layer to become dependent on some quirk or unique feature of the underlying database, and sometimes that dependency is not obvious. The sibling reply is not far off the mark when it comes to using ORMs "in practice".
In one of my gigs, it was normal operating procedure to put logic in sprocs and call out to that from elsewhere. It makes so much sense - RDBMS engines are good at doing math, so we should let them do as much math as possible. ORM layers hide that away, and thinking about these problems in Ruby or Python work, but those languages just aren't as good at plain math as a database can be. I've done this type of thing in Rails and Express apps, and it makes complete sense there. You can even keep the sproc in source control by creating and changing them as migrations. It's tops.
I definitely agree that there may be some unexpected consequences of using the primary key in this way. It certainly won't behave like a timestamp and could distort the function outputs.
Would it be possible to substitute the integer count of seconds since epoch (or some other arbitrary base timestamp) and use that instead? Then you'd still have an immutable function to satisfy the index, but also you'd preserve valuable information about the relative time intervals between posts.
Yeah. If you wanted to delay making posts visible but create them in advance, they wouldn't have the same ranking as though they were just posted. You'd have to create them at the exact time you'd want them to start being displayed in order for the desired ranking to be correct.
The example was a social media site like Instagram, not a blog, which doesn't allow future posting. If you have different design constraints than the given example, obviously the implementation will need to adjusted to the new requirements.
>The integer value of a timestamp is just an increasing counter, ticking off each microsecond. The database itself has another form of increasing counter, the primary key.
This observation is either profound or trivial, depending on the mindset. The hundred-fold improvements it brings (along with the immutability) are not trivial, by any measure.
what happens if i got 10 posts then before getting the second 10 one of the posts that i should get in the second query got 50 likes, it will got higher ranking and i will miss it, right?
I think ~0.442ms is very slow for a social media web or app, considering that it's only 1 query, maybe that example isn't the best for that kind of queries
People are downvoting I think because this is a rather pedantic point to be annoyed by. The tone is one of superiority, rather than pointing out the simple mistake.
Calling a rails function a ruby function is unlikely to invalidate the entirety of the article.
I know the point is somewhat pedantic, but I just don't really like to learn from people who doesn't even know what is what. Somewhat invalidates the article IMO, but I read it anyway, just with very carefully, trying not to learn anything just understanding the basic idea he/she want to tell.
I think they downvote because they are so ignorant I can't even believe it.
HN users will also down vote for tone and complaining about receiving down votes. Please strive to contribute constructively and in keeping with the guidelines.
Perhaps they are downvoting because of your own ignorance? The aforementioned code does depend on ActiveSupport (or similar library), but has absolutely nothing to do with Rails.
I am the original author, and I assure you I know the difference between Ruby and Rails. This was written over 3 years ago, and while I don't remember all of the details, I don't believe conflating Ruby level code with Rails specific code corrupts the article.
It most certainly is Ruby. It's Ruby with a library (ActiveSupport) added that augments with extra helper methods. When you write Ruby code that adds methods does it suddenly become not Ruby anymore?
It is Ruby in that it requires a Ruby interpreter to execute that line of code. It certainly isn't Javascript or Python. Is the article missing the context that these lines of code are being executed along side the Rails framework? Sure. Is it distracting from the main meaning of the article? I would say not.
That's fair, and definitely warrants skepticism, but was there anything in the article that you found counterfactual? I'm no RoR expert, but I do enjoy looking at how other people solved problems. Is there anything in the article that's counterfactual or that you've experienced causes headaches down the road?
This account has been posting many unsubstantive comments recently. Please stop and re-read the guidelines: https://news.ycombinator.com/newsguidelines.html. We have to ban accounts that continue on like this.
We looked for some obvious optimizations like eager loading and some map reduce magic and implemented them. That brought us to a range of 45 minutes to 2 hours.
We decided that it was still not good enough. We moved ALL of the report into SQL functions (Postgres) and we have the whole function running in 27 seconds.
I agree with some of the top comments that maintaining this is a little trickier than ruby code, but some ideas we came up with in maintenance: 1) Write the SQL in a ruby file that has a class method of sorts that returns the generated function. This allows for version control to track the changes of the sql function. 2) Focus on tests at the interface level. We had great test coverage on the end result prior to moving to SQL. We were able to re utilize those tests to verify accuracy with calculations that are now in SQL.
This really isn't for everyone, but for those that want to try something new, it is worth it.