Hacker News new | past | comments | ask | show | jobs | submit login
Effectively Using Materialized Views in Ruby on Rails (pganalyze.com)
158 points by moritzplassnig on Jan 16, 2020 | hide | past | favorite | 36 comments



Overall, I love this approach, but the big pain for me is the lack of incremental view refreshes. I end up needing to recreate giant tables each refresh even though the underlying changes are small. Sure I could implement that myself, but that sacrifices the correctness guarantees that materialized views provide!

Two things that would help:

1. Getting Incremental View Maintenance (IVM) [1] into Postgres. It looks like work is beginning on this but it's been 7+ years coming. If there are any Postgres devs looking at this, I'm cheering for you!

2. There's a commercial group doing an implementation of this, Materialize [2], but they don't have any publicly released product yet. I mention it because their interview on Data Engineering Podcast is really good [3] and I really like their focus on correctness by working from replication logs.

[1] https://wiki.postgresql.org/wiki/Incremental_View_Maintenanc...

[2] https://materialize.io/

[3] https://www.dataengineeringpodcast.com/materialize-streaming...


Likely too boring and old school data warehousing, but this is what the notion of rollup/aggregate tables achieve. You define a table that's the result of some sort of an aggregation query (time is almost always involved: minutely, hourly, daily, monthly plus other business-specific dimensions) and at the conclusion of each corresponding interval you insert the next respective batch.

    e.g. insert into rollup_clicks_hourly select hour, count(*) clicks from clicks where created_at >= ? and created_at < ? group by 1
Then from a reporting perspective, users/dashboards/etc read from the highest relevant table that meets their criteria.


I worked with a company who were doing (basic)analytics on their live ERP db, sales and margin reporting etc. There was a view written to join in all of the common tables so the in house people could query it easily, things like product names, customer names and addresses etc. When performance fell off they materialised the view as a stop gap... until they got around to a data warehouse. 10 years later it is still there, building each night! It was actually the perfect solution for them, simple analytics with no more infrastructure!


That's a very clever concept.

Could you recommend any classic old-school books about data warehousing that you'vre read that teach more such techniques?

Thank you!



Yup, exactly. Kimball's stuff is the best. You can achieve on a modest machine what modern techniques would require an incredibly expensive horizontally scaled MPP database. It does require a lot more planning and forethought to be certain.

https://www.amazon.com/dp/product/1118530802 and possibly https://www.amazon.com/dp/0764567578/


You can do effectively the same thing using a physical table and triggers [1] if you want a pure postgres solution. I ended up with the same problem and just ended up using a physical table and implemented the updating logic at our app layer. Not exactly fun or easy to implement all the updating logic yourself though...

[1] https://hashrocket.com/blog/posts/materialized-view-strategi...


I'm very hopeful for approaches that model around this need from the ground up. Big Data distributed stream processing frameworks like Spark or Flink allow for standing SQL queries that are incrementally build; and you can even subscribe to react to the results.

There are other initiatives for plugging it on existing RDBMSs, like Noria [1] for MySQL. It allows for subscribing to changes; and also for lazy evaluation of the materializing view rows, keeping them in cache for later.

[1] https://github.com/mit-pdos/noria/blob/master/README.md


Like you said, not easy: Recreating the functionality of views using triggers is error prone and a heavy maintenance burden, and the triggers have to be implemented differently based on what each query is like... In a lot of cases it's better to just optimize the query and indexing.

In PG you can make a lot of queries run straight from one index if you tailor an index to a query, since you can index your own plpgsql functions.


Materialized views are fundamentally just some syntactical sugar over the results of a query cached in a table.

The syntactical sugar does have advantages, but comes at a cost of flexibility - if you need more control over the update process or modification of the data than a full refresh, then you should just use a table (create table whatever as select * from ...).


It also seems odd to me that the refresh is scheduled in the app instead of the database.

Can anybody here speak to that choice?


Yeah. Coming from a Django world it's about avoiding surprises for application programmers. I suspect it's the same in Rails. Historically Django has almost entirely had logic at the application level rather than pushing it into the DB: scheduled jobs, data validation, trigger-ish logic (via `post_save` signals and such).

Part of the power of OP's technique is that it looks exactly like a standard Django/Rails data model with a tiny sprinkle of magic -- no surprises. It's surprising for a Django programmer to hear "all periodic tasks are handled via celery, except these table refreshes" or "signals are responsible all work in response to model changes, except these triggers".

Obviously in some cases you need a trigger for correctness, but in general I try to stick to the conventions of the ecosystem.


Do you mean using something like `pg_cron` instead of calling it through Scenic and a cron fired at the app level?

I personally don't like having my cron distributed in different areas. All my crons are in one place/system like easycron.com or setcronjob.com or whenever. Performance is not a consideration when deciding to run the job directly from the DB or the App.

I answered the question sentence directly, but maybe that's not what you were asking?


Yeah, pg_cron, then. Updating a materialized view feels like a data integrity issue which should be addressed within the DB itself, if not the DDL for the view itself.


I used a technique very much like this in the past. Here are two potential gotchas using matviews with rails, both related to the inevitable reality that DDLs evolve over time:

- postgres supports concurrently refreshing the contents of existing materialized views but there's no built in way to change the structure of the matview concurrently. Which means any `ALTER MATERIALIZED VIEW` will result in all reads to the view blocking for potentially many minutes or longer, for views over nontrivial data. Adding a column to a table is a cheap and non-blocking migration, but allowing the new column to appear in matviews can require an outage or fancy transactional view name swapping that libraries like scenic don't support.

- a column that is included in a view of any kind cannot change in any way, even ways that are binary-compatible like VARCHAR->TEXT. This comes up surprisingly often and getting around it is annoying.


How about create new materialized view, then rename (or drop) the old, rename the new? That is assuming you don't need model changes, or can guard/fence them somehow.


View Backed Models are a super power. If you’re a working Rails dev and you’ve never used them before, you really owe it to yourself to check them out. IMO, they should be baked into Rails itself. They’re that useful.


Are there any sensible ways to do this with Django? Ideally with migrations to create and update the views.


> Are there any sensible ways to do this with Django? Ideally with migrations to create and update the views.

There are apps to integrate PG views as django models, however is kind of trivial to do it yourself.

1. Create a migration that runs the SQL to create the view like ```create VIEW active_users AS SELECT rest of view ... `

2. Create a model mapping all the fields you want to use, add the Meta.table_name with the name of the view, and Meta.managed=False to avoid adding this to your migrations in the future.

Use it as any other normal Model.

Just that

About the updates with migrations I think if you already are thinking on using VIEWS, you could just create migrations that drop and recreate it when you needed. Is what I do, and is very little work for that.


Here is an old article that I used to implement my own https://www.fusionbox.com/blog/detail/using-materialized-vie...


I am working on this! [1] Albeit slowly because it is not my full time job.

What I have so far is support in the schema editor and very basic migration support. The underlying query is defined as a meta option on the model.

[1] https://github.com/SectorLabs/django-postgres-extra/


If you use a migration tool that works at the database level rather than the model level, the the migration handling of views will be automatically supported.

Last time I tried to do model a database view in Django it was painful, but support may have improved since.

But worst case you can drop down to a vanilla query.


I once worked on a larger Rails project where we had decided to utilize Postgres views and materialized views to optimize the performance of parts of our application. This was about 7 years ago, so before Scenic existed to help with this (or at least before it was known). We had to manually manage the use of our views while keeping Rails apprised of the underlying schema through custom schema files and migration methods.

One of the main issues we had run into, in addition to all the complexity that comes with managing a cache and no longer being able to count on your db queries to return the latest information, was that our views had to be deleted and re-created for nearly every schema change (at least every schema change that affected the views, but we had so many views, this ended up being probably 90% of our schema changes).

We ended up overriding the standard Rails "up" and "down" migration methods to prepend the deletion of these views, append their re-creation, clean up after failed migrations, etc. I remember us spending a good amount of time across the team dealing with weird edge cases that cropped up from this. I also remember spending a fair amount of time training all the developers on these issues since almost no one on the team had experience using or dealing with database views. I assume managing migrations and schema of your views within Rails is all included in what Scenic does for you for free, since these hassles weren't mentioned in the article.

It's really nice to see something that handles all that complexity for you. However, that still leaves the standard and age-old caching issues with stale data. Another side effect was that they started being a bandaid for inefficient queries and data structures in the database, which is probably how we ended up with so many db views in the first place.

In the end, after probably 8 months of using these views, we decided to excise them from the application entirely and go back to optimizing our actual database queries. We got to delete a lot of custom code, our schema changes and migrations became simpler, our data became fresh again, and we got to reinvest that time into improving our underlying data structures and queries.

This probably is not an argument against using db views, but rather an anecdote of what can happen when you resort to them prematurely. If you have queries that are 100ms or more, there are probably more traditional optimizations you can find, such as restructuring your relational data, adding/removing indexes, etc. If you're actually trying to eliminate those last 10s of milliseconds from your queries though, as is shown in the article, I can see them being a good option.


Another approach could be to create the materialized view and then use some sort of event system say the pg_notify subsystem to capture an event and then increment some aggregation -- this would eliminate the need for scheduled view refreshes and keep the view almost real-time and quick.


This might be best done using triggers. I've had good experiences with the Hair Trigger gem for managing these in rails.


Trigger would work. But going the pg_notify approach could allow for asynchronous updating whereas the trigger approach could slow writes.


Real-time means latency is predictably within some deadline. But the system he describes is returning wrong answers quickly rather than invalidating stale cache hits and recomputing right answers too slowly.


You could do something similar with built-in ActiveRecord callbacks as well.


"There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors." (Phil Karlton, et al)

Views (both regular and materialized) are really useful, but remember that storing data in two places really is one of the of the hard problems in CS. Refreshing the materialized view on a timer might seem easy, but 6 months from now when something important reads from the stale view instead of the real table could become a really frustrating bug. A materialized view might still be a worthwhile optimization; just remember that you're also adding cache management, which might be more complicated than you suspect.


Presumably if you're materializing a view, you have to have a cache for speed — it's either this or ElasticSearch or Redis. For me, where possible, it's nice to have a correct cache a `REFRESH MATERIALIZED VIEW` away instead of needing to make sure your cache book-keeping is perfect with other solutions.

But yes, you're right that this is caching, with all associated pitfalls.


At the end of the day, whatever benefits are supposed to derive from the technique seem mostly academic.


Is there an upper limit on the number of materialized views that it makes sense to have in a database?

I've been trying to figure out a way to deal with my (growing) join of 13 rather large tables when fetching all of a user's data. Are materialized views the kind of thing that I could generate these cached results _per user_ and have, like, hundreds of thousands of them sitting around to query? And then be able to query against all of user 123456's content directly instead of filtering N tables for their content and joining them all together every time?


How often does rails refresh the materialized views? And is it automatic or do you have to explicitly refresh them?


You typically do it on a schedule, e.g. once per hour, or once every day.

In the post the whenever gem (https://github.com/javan/whenever) is used, but you could use anything that is able to run things periodically.

If you'd want to stay solely within Postgres, there is also pg_cron (https://github.com/citusdata/pg_cron) which could be used to call REFRESH MATERIALIZED VIEW from within the database.


The post covers this under the section "Refreshing a materialized view".


In addition to what the other commenters mentioned, if you’re importing data, you can refresh the affected materialized views after the data import is complete.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: