Hacker News new | past | comments | ask | show | jobs | submit login
Using PostgreSQL as a Data Warehouse (narrator.ai)
435 points by cedricd on May 10, 2021 | hide | past | favorite | 150 comments



Dr. Martin Loetzsch did a great video, ETL Patterns with Postgres. He covers some really good topics:

- Instead of updating tables build their replacements under a different name then rename them. This makes updating heavy-to-compute table instant. Works even for schemas: rebuild a schema as schemaname_next rename the current to schemaname_old then rename schemaname_next to schemaname.

- Keep all the source data raw and disable WAL, you don't need it for ETL.

- Set memory limitis high.

And lots of other good tips for doing ETL/DW in postgres. It's here: https://www.youtube.com/watch?v=whwNi21jAm4

I really appreciate having data in postgres. It's often easy to think that a specialised DW tool will solve all your problems, but that often fails to consider things like:

- Developer experience. Postgres runs very easily on a local machine, more specialized solutions often don't or are tricky to setup.

- Learning another tool costs time. A developer can learn postgres really well in the time it takes them to figure out how to use several more specialised tools. And many devs already know postgres because it's pretty much the default DB nowadays.

- Analytics queries often don't need to run at warp speed. Bigquery might give you the answer in a second but if postgres does it in a minute and it's a weekly report, who cares?

- Postgres is boring and has been around for many years now, it will probably still be here in 10 years so time spent learning it is time well spent. More niche systems will probably be superseded by fancier, faster replacements.

I would go so far as to say don't necessarily need to split out your DW from your prod DB in every case. As soon as you start splitting out a DW to a separate server you need some way to keep it in sync, so you'll probably end up duplicating some business logic for a report, maintaining some ingestion app, shuffling data around S3 or whatever. Keeping your analytics in your prod DB (or just a snapshot of yesterdays DB) is often good enough and means you will be more likely to avoid gnarly business-rules going out of sync between your app and your DW.


The tooling behind the approach has been built as a set of python package named Mara. It is available at GitHub:

https://github.com/mara/mara-pipelines

And additional packages can be found at the Mara org:

https://github.com/mara


And if say that if you are in the position where you can run DW workloads on your prod database you probably don't need a data warehouse in the first place.

Data warehouse workloads tend to be very IO intensive and could be highly disruptive to a production db. ETL is hard but it's a price to pay to isolate these two very different workloads.


DW is a capability, not a scale problem.

DW lets you gain insight from your available data, whether there's a lot of data, or not.

DW with postgres is probably the right choice if your data isn't petabytes level.


DW is a scale problem, what you are describing is data analytics.

DW has always described the storage solution not the workload.


From a company perspective, DW is an implementation detail for the utility it provides, which is analysis and reporting. These activities have such different lifecycles, accountabilities, and controls, to operational data stewardship, that the pattern of a BI team operating independently from the corporate IT unit may arise by governance pressures alone. Differences in storage architecture and data schema aren't considered when it comes to carving up the org chart, but they're certainly enabled by the consequences: in my experience of large-to-very-large enterprises it's common for BI teams to take practically vertical control of their entire stack, from the racks to the apps (or in public clouds, one or more separated accounts), and all the infrastructure in between, a classic sighting of Conway's law in action.

More than once I've pointed at such teams as an explainer when pitching a service-team approach to standing up a new application. (Caveat, lest it backfire if you try the same, check who you're talking to first; they're not always fans of "the BI gang")


I hate to admit this, but I’m definitely one of the folks who isn’t a fan of the BI gang that operates in my workplace. They seem to have a very superficial understanding of how anything works. Their stuff constantly breaks on schema changes, and their response is never “hey, this seems like a broken workflow. How do we improve it?”. Its always “@here our DW may be broken but we will get it back up soon” and radio silence as they try to privately figure out how to fix something they don’t quite understand.

Not to mention that they have hardly any monitoring/alerting, which means that the data in the DW is not reliable, which can be really frustrating for users.

Overall I get the sense that these are unskilled folks who oversold what they could do but ultimately have very little clue what’s really going on.


A more common experience with BI teams than people are willing to confess to, especially when the business the BI team is meant to be tracking and measuring is growing faster than the team can manage. In my experiences and and observations, this phenomenon where a BI teams' processes/products are unstable is usually because of a team's lack of depth in DBA skills. Not necessarily the lack of a DBA, but the lack of any engineer with any eye or attention given towards environmental stability, optimization, and the traditional "janitorial" clean up work that can tend to be neglected in the BI space.


If the BI teams stuff constantly breaks because production databases change I would say the software engineers are doing a shit job. It’s the engineers responsibility to communicate such breaking changes in advance to teams that rely on their products.


I see your point, but Engineers should not have to worry about dB schema changes breaking anything. The contract they provide externally is via APIs. The BI team slurping up dB binlogs must not create an expectation of support from engineers.


That's not true as soon as anyone else is relying on your schema (and data warehousing certainly does). At that point, you're maintaining an API contract whether you like it or not. Fortunately, databases also provide mechanisms for abstraction (like views) that can make this much easier on both the developers and the DBAs, which is an approach I've personally had good success with in companies with similar requirements. The key is to stop thinking of your database as an application-specific store and start thinking the same way you do for normal development--which tables and schemas are internal and subject to change, which ones are exposed to the outside world, what your stability guarantees are, etc. etc.

Indeed, you can even guarantee backwards compatibility by versioning views you expose as APIs to other applications, and having each depending application create a view in its own schema that references the view--in Postgres, this means that only backwards-compatible migrations to the original view will go through! So when you find you can't make a change you need backwards-compatibly, you know it's time to make a new version, and you can keep the two versions of the API in sync by recomputing data in the new view to the old one until the depending applications can upgrade :)


> That's not true as soon as anyone else is relying on your schema

Well, don’t do that then. I think the parent comment suggested that it was an anti-pattern.

Martin Fowler seems to agree: https://martinfowler.com/bliki/IntegrationDatabase.html


With all due respect, Martin Fowler's opinion on this matters a lot less than those of the people you're actually working with. There are many advantages to exposing your data this way, particularly performance and transactional integrity advantages, that are extremely difficult (and sometimes impossible) to recover otherwise; and like I said, there is no inherent reason why presenting an abstract contract as a view needs to make evolving your codebase any harder than presenting one as a REST API (or whatever your other preferred choice is). It "ties you to the database you're using" only in the sense that anything you don't expose through the database won't enjoy the same performance optimizations and transactional consistency, which would be the case anyway if you used a second database.

The scenario presented on the website (a separate organization controlling the database schema and application developers having to negotiate with them to get what they want in) is not what I'm proposing here anyway, and seems to be deliberately conflating the use of the database as an API and some sort of organizational pattern that is mostly orthogonal to that. There is no reason applications can't have their own private schemas that are managed by the application developers, who then choose what data to expose as public (and what as private), while still sharing a database with other developers. And for applications where database operations are particularly performance sensitive or safety critical, there's also no reason why not exposing your schema to other applications implies that application developers are allowed to have full control over it. I've been in organizations at various points in time that do both of these things.

Of course, there may be other restrictions that make this a bad idea (such as security restrictions that for whatever reason can't be enforced within the database), but for consumption for BI processes this is not really applicable anyway.


So if engineering team uses an ORM, pushes a massive API change that renames columns and removes others, and drops FKs, and leaves it as a change item in a release list in an access controlled repo without notifying anyone, it's BI team's fault the DW ETL breaks?

IMO it's not fruitful to throw blame on teams or call them unskilled. Usually, everyone is trying their best to do their job, and things that make this not happen should be seen as "how can we improve/fix this" - better communication is almost always the solution.


It’s the engineering team that creates the actual value for the company. Of course they should have an easy path to clear technical debt.

Database level integration is tempting at first since it’s zero effort up front, but it usually ends badly. It’s well worth it to make the BI team go through an API, even spend the effort on a well-designed one to give them a nice experience doing so.


> It’s well worth it to make the BI team go through an API, even spend the effort on a well-designed one to give them a nice experience doing so.

In an ideal world yes. But sometimes, the engineering team wants nothing to do with the data itself, and so the BI team is often left to do what they can with what they have in order to do their job at all.

Being a data engineer, both sides of the coin often fall on me, and so I understand the difficulties on both side. Almost always (anecdotally, so take with grain of salt), the problem lies with upper management because they aren't willing to invest the time to actually build a proper infrastructure in place; the focus is on building features to satisfy clients, without much thought on how this can affect the overall internal system, and so both sides have to rush to do things without thinking how it affects the other side. However, it likely won't change because at the end of the day, the client is happy - whether or not this is a "unfortunate reality" of business/tech, is a discussion for another day.


Until I read the second I was going to say to your first paragraph "then the company clearly doesn't value BI enough to warrant the existence of a BI team".

It's probably more a symptom of upper management valuing neither the core team nor the BI team, thinking software writes itself, and business insights come for free with the MBA. We're really in the same boat here. All the more reason for core team to make nice hooks for the BI team. They won't be appreciated by management, they probably won't be appreciated by the customers, but at least they can be appreciated by each other.


In my experience (which doesn't mean I'm disagreeing with your experience), it's not that upper mgmt doesn't recognize the value of BI - they just don't want to invest in it. They want their cake without having to pay for it. It can be hard to find a place with mgmt that really understands that investing in developer velocity and good infrastructure pays way more dividends in the long term than constant short term "promise X, deliver X*0.8 to make customer think they got what they want, and repeat to get the contract".

But yeah, we could complain on and on about all the problems of shortsighted management... it's a tale as old as time.


Im exactly at this point now, where we have Bi which breaks from time to time. Sometimes because of infra sometimes because of source data model changes. Im now thinking how to prevent the second one, my plan is to create DB views for data that BI needs so that source system would be notified automatically notified when trying to change schema. In a sense view will become API for data. Yes and monitoring is a must have for everything.


This is the problem with business people being let loose to sell ideas without the technical depth.

A data warehouse is not a capability, it's a data system. Company A with a warehouse and company B without do not necessarily differ in any capabilities.


You can run the workloads on (possibly partial) replicas of the production database running on another set of machines.


Re: renaming tables - yay! It's essentially double-buffering for databases.

I've used a slight variant of this in the past: I'll have a table (e.g. my_schema_new.my_table) that gets updated by an ETL pipeline. I'll then also have a matview (e.g. my_schema.my_table) that's just SELECT * FROM my_schema_new.my_table. As long as I can give this matview a UNIQUE index, I can then REFRESH MATERIALIZED VIEW my_schema.my_table CONCURRENTLY for zero-downtime updates.

(Of course, if you're using this in a non-interactive data warehouse context, you might not care so much about zero-downtime updates; this is more for application-facing views. The REFRESH ... CONCURRENTLY pattern can be faster for incremental updates, but often struggles with larger changes to the underlying table, as it's essentially applying a diff between the two versions. Also, it only works in cases where your users can tolerate data that's as stale as the scheduled time between REFRESHes.)


> Instead of updating tables build their replacements under a different name then rename them.

This works really well, although a few minor warnings.

If you run backups, they will block DDL changes like creation and rename of a table. Now this might block all reads on your table until backup completes. Make sure you do this on a schema that isn’t being backed up or do other workarounds.

Other than that, this trick is really nice for instant replace and can also reduce disk space churn / fragmentation.


This works well until it doesn't and you're incrementally adding time to the loads. Some DBs handle it better than others. What most are looking for is performant, less constrained materialized views. If you're running ELT then MVs are a big deal. Additionally some DBs (SQL Server) don't have an option for pure "no logging" so, even with simple, there's overhead to the large writes.

I like schema swaps too for the tables rather than renames but the pattern is the same.


In the above mentioned scenario there is no need for backups as the database (or more precisely the schema) is recreated at each run.


That first point is exactly what dbt (https://www.getdbt.com/) does. It's an amazing tool, with extremely well thought out developer ergonomics.


Snowflake + dbt is great for this. Snowflake have a swap database function so instead of doing this at a table level you can create an entirely new database and do a blue/green style deployment. Not sure how this works with pgsql, but I suppose more complex.


You’ve described the ELT approach that is why dbt is taking the valley by storm.


And if you use a replica for analytics the overhead on performance is tiny!


This should be the first step every company does as it scales out. If you can no longer run analytics against your production database for whatever reason, fire up a secondary replica with a 12/24 hour delay and you're golden.


> - Instead of updating tables build their replacements under a different name then rename them. This makes updating heavy-to-compute table instant

isn’t this what materialized views are for?


I agree that a materialized view is a lot more ergonomic and probably the correct choice in most cases.

Refreshing a materialized view blocks selects from it though, rendering it inaccessible while the refresh is taking place. There is the CONCURRENTLY option but that may be more costly and requires a unique index on the view.

There's also the issue of rebuilding several dependent views. Say you have 3 dependent views that are all being shown in some dashboard: 1, 2 and 3. You refresh view 1 OK but view 2 fails for some reason. Now view 1 and view 3 are out of sync and view 2 is broken.

Encapsulating the 3 dependent views in a schema and only doing the name swap once all steps are OK prevents you from ending up in this state. If view 2 breaks you wont do the name swap so you'll be serving stale data, which is probably better than no data for 2 and out-of-sync data for 1 and 3, until you can fix it.


Thank you for the elaborate response. So I think the question is how near time your view data needs to be. Especially for reporting purposes (which is probably the most common use case of data lakes) it’s often acceptable to have data ready per T-1 (in terms of business days). For financial reports, last day’s close prices of assets is usually what’s being used. So, being able to relax requirements in this manner can reduce a lot of complexity.


Absolutely. Complexity in our own system pretty much exploded when we wanted to serve stats at near "real-time" (a couple of seconds delay). If business could just chill out and wait until next day, or even like an hour, we could have saved countless engineering hours. Such is life.


Why would developers need to learn another tool? They are all just SQL at the end.


A new Postgres-based darling is TimescaleDB. It’s a drop-in for Postgres.

It is a hybrid row column store with excellent compression and performance.

It would be interesting to see how it compares if narrator would try it out. Benchmarks would be cool.

One very neat feature I am enamored by is “continuous aggregates”. These are materialized views that auto-update as you change the fact table.

Continuous aggregates are a great idea. InfluxDB had “continuous queries” (but the implementation of influx generally is not so neat), and firebolt has “aggregate indexes” which are much the same thing.

I think all olap dbs will eventually have them as staple, and that they will trickle down into oltp too.


AFAIK is an extension to pg and not a “drop in” replacement


Would TimescaleDB be much faster for analytical queries that aren't necessarily segmented or filtered by time?

My uninformed assumption is if I do a group by over all rows in a table that they may not perform better.

I'll look into their continuous aggregates -- that could be one way to get around the cost of aggregating everything if it's done incrementally.


The reason why TimescaleDB is so fast relies really a simple concept; keep indexes small enough to keep in memory. If you have a small enough index, you only need to do 2 reads from disk; one for the index to locate the data, and then one for the actual data.

Fetching data becomes much, much slower once your index is too large to fit in main memory. TimescaleDB segments the indexes into chunks (the “hypertables”) and makes sure these chunks are all “small enough”.

This alleviated further by having the data sequential by time; inserting new data does not need to alter older index chunks, which is what makes inserts fast.

I can imagine that if that’s not the case and your inserts are altering “older” chunks so data needs to move between lots of chunks could make the database prohibitively slow.


I would have to dig more into specifics of your use case but my gut reaction is yes, it would be better.

I do not have specific experience with TimescaleDB, but I have some experience scaling PostgreSQL directly and with Citus (which is similar, but not the same). But depending on the nuances of your use case, I can envision a number of scaling strategies in vanilla Postgres to handle your use case. A lot of what Timescale and Citus does is abstract some of those strategies and extend them. Which is just a vague way of me saying: I think I could probably come up with a scheme in vanilla Postgres to support your use case, and since Timescale/Citus makes those strategies even easier/better I am fairly confident they would also handle that use case.

As an example I currently have a table in my current Citus schema that is sharded by column hash (e.g. "type" enumerator) and further partitioned by time. The first part (hash based sharding) seems possibly sufficient for your use case.

Beyond the most simple applications in that domain though, there are more exotic options available to both Timescale and Citus that could come into play. For example, I know Citus recently incorporated some of their work on the cstore_fdw into Citus "natively" to allow columnar storage tables directly:

https://www.citusdata.com/blog/2021/03/05/citus-10-release-o...


I’d guess any roughly sequentially keyed table ought get good insert performance. Think how common auto-increment is.

And being HTAP, timescale ought do better than a classic pure column store on upserts and non-appending inserts too.

Of course if your table is big and the keys are unordered you still might get excellent performance if your access pattern is read heavy.

Of course you can still mix in classic Postgres row-based tables etc. Timescale just gives you a column store choice for each table.


> if I do a group by over all rows in a table that they may not perform better

Computing aggregates against contiguous values in memory or disk for a single column will always be faster than reading records with differing value offsets/alignments. These operations can benefit from SIMD and other hardware optimizations you don't get with row-aligned data.


Much faster, if you can leverage its strengths. It helps if you have a time column or monotonic increasing value like a serial column. But just using the table compression and columnar format can give a order of magnitude speedup on analytics queries (because you load and scan less bytes.)


This recommendation reads backward. You made a suggestion of a substitute that's "comparable" yet without any data and then go to nudge the author to try it. How do we know you didn't just pull up something just to have the author test it for you?


> Avoid Common Table Expressions (or upgrade to 12)

Upgrading to 12 (or 13) seems like the better option here, whenever you're able to do so. The improvements are very much worth it.


Something painful I just learned today is that even Postgres >= 12 isn't that great with planning CTEs. Querying the same CTE twice seems to force it into materializing the CTE just like Postgres <12 used to do.

Fortunately there's a workaround - using the `with .. as not materialized ()` hint sped up my query 100x.


Yeah, the only thing PosgtreSQL 12 did was to remove the optimization fence. There was no additional optimizer logic added.


What situations would make people unable to upgrade to 12+?


One case I decided against it was when I needed to access the new postgres server (which I initially planned to be v13) from an old machine (with a legacy distro with v9.6). v13 introduces a new SCRAM-SHA-256 password hashing method by default and only libpq10 and newer supports this method. For some reason I couldn't or didn't want to neither rehash the passwords nor upgrade the client, so I remained on a lower version. Certainly not unsolvable, but I didn't have the time to spend on a proper fix.


What situations would make people unable to upgrade to 12+?

In my personal experience: the organisation doesn't invest in a storage solution that offers snapshots/COW (e.g. ZFS or SAN or whatever). Then they wait to upgrade until their disks reach a usage that the upgrade has to be done in-place. Then they become like rabbits in the headlights and never upgrade.


Maybe it doesn't get prioritized until it's important. I know PG upgrades are pretty straightforward, but sometimes people don't want to touch something running well.

That said, given the performance implications, if someone wants to use PG as a warehouse upgrading to 12 is a no-brainer.


Yes, that's a really great point. I should emphasize that more clearly in the blog :).


A few more items I've encountered:

- Citus data, not Citrus data.

- In tables, column order matters. Order const-width non-null columns before all other columns (best so that there's no unnecessary alignment padding). Then maybe some low null fraction fixed-width columns, then ordered by query access frequency. A lot of time can be spent extracting tuple values, and you can save a lot of time using the offset caches by correctly tetris-ing your columns. Note that dropped columns are set to null, so a table rewrite (by redefining your table) may be in order for 100% performance.


Isn't #2 something that the database engine should handle by itself: const-width, non-null is written in DDL and the engine should be be able to handle it.


yes, ish. But Postgresql has no reordering of table columns builtin, mostly due to the limitations of the internal representation of its schema (also called features) and in how the DLL is performed.

Adding or removing a column in this system does not require postgres to rewrite the whole table, which means that old data can stay in the table effectively forever, as long as there are no other rewrite-required DDLs performed. Additionally, columns can be updated to SET NOT NULL / DROP NOT NULL, further complicating the whole system you're trying to optimize.

Eventually postgresql might support some form of table column reordering that permanently optimizes out deleted columns, but I think it's unlikely to happen anytime soon. There are lower hanging fruits on the tree; altering existing table definitions in a backwards-incompatible manner is a lot of effort for likely very little gain.

As for column packing: Maybe this can be implemented for CREATE TABLE with an option, but in the current transactional DDL framework this cannot be implemented for ADD COLUMN, because we can't reorder columns.


Yes, it is strange that "Avoid CTEs" made it into the TL;DR instead.


I updated the blog post :)


No mention of Greenplum? Literally a columnar DW built on top of pg.

Where are my DuckDB people? (Think SQLite for OLAP workloads.)

https://duckdb.org/


I love postgres and my business relies on it. However, at the scale the author is talking about (~100m rows), throwing all the data into BigQuery is very likely to be a better option for many people.

If rows are around 1kB, then full-dataset queries over 100m rows will cost less than $0.5 each on BQ -- less if only a few columns have to be scanned. Storage costs will be negligible and, unlike pg, setup time will be pretty much nil.


Yep. Fully agree. The point of the post wasn't to say that you should use PG as a data warehouse. Just that if it's what you've got available (for various reasons) that you can.


I've had well over a billion rows in multiple tables with PostgreSQL and it wasn't a problem at all.


$0.5 queries?


Per the BigQuery docs, GCP charges $5 per TB of data read. Which works out to be $0.50 per query if said query reads 100 GB of data.

It feels expensive, but then again running and maintaining a big data platform is inherently very expensive. If you consider the fully loaded cost of a single big data engineer to maintain such a platform to be ~$250,000 (likely an underestimate if you want similar performance characteristics to BigQuery [disclaimer: never used it myself, but I assume its performance is near-unbeatable]), that'd be ~500,000, 100 GB queries. Which makes a $0.50 query feel reasonable relatively speaking. GCP also sells dedicated "slots" (as they call it) which as I understand it is an abstraction over a CPU. If you buy said slots, the marginal cost of queries is $0, but you may be subject to queueing. No idea what a "slot" actually represents however.

https://cloud.google.com/bigquery/pricing#on_demand_pricing


Full scan. Still absurd to what we've come to to think this is tolerable.


A full scan rarely happens on BQ because of the nature of the columnar store. Try some public dataset like HN archive, and see how a query actually costs. You'll need very advanced (or stupid) query to read 100GB at once on BQ.


Reading 100Gb from disk costs $0.50? this is absolutely incredible to me, how has this become acceptable to the industry?


It's not for every use case, but Big Query is often a very stupidly cheap datastore. Query results get cached, and repeats don't incur a charge unless the data has changed.

It's not a datastore to power a crud app, or anything requiring frequent queries, but it's a great place to stash gobs of logs that you may need to query at some point. Or it's great for serverless batch workloads and is often cheaper in both time and money than firing up spark clusters or something similar to do the work.

Quite frankly, it's awesome. But sure, they do use it as a tool for lock-in, and for some cases it would be prohibitively expensive.


Incredible as in 'that's a great deal' or as in 'that seems a ripoff'?

I find 50c to read 100GB from disk, do useful work on it (including running javascript code or ML models if you are so inclined) and returning a result in seconds... pretty damn incredible.


We find value with this model because we don't pay for the instance when it's idle and queries come back extremely quickly.


A query reading about 100GB with one of the most advanced data warehouse systems with no operational overhead and integration into a major cloud environment costs $0.50.

There's a lot more to value than the price.


Exactly.


The point of BQ is to allow you to perform queries which are ad-hoc and/or touch a significant fraction of the data. If you have a problem of that shape, then full column scans are not merely tolerable, they are optimal.


You don't pay for having a server running. You only pay for storage and data read during queries.

No server fees and no fees when autoscaling up for heavy computations.


Is anybody aware of any serious research Yugabyte vs Postgres?

Yugabyte appears to an application as essentially Postgres 11.2 with all psql features (even the row / column level security mechanisms) but, apparently, handles replication and sharding automagically (is it DHT, similar to Cassandra)?

Are there any serious technical comparisons with some conclusions, without marketing bs?


Yes, Yugabyte has been featured on Jepsen. It's basically in the class of "newsql" natively distributed relational databases.

Yugabyte uses actual Postgres code for the query parsing top layer and then translates that into to operates on it's key/value store which handles replication and distributed. CockroachDB is similar but has built everything from scratch in Go. There are similar examples like TiDB and Vitesse for MySQL as well.


The Jepsen tests do not inspire confidence but they're also pretty aged by now. 1.3.1 under test and the most recent appears to be 2.7.0.

Cockroach is a no go for me because of the licensing restrictions. For example, password authentication only in the BSL licensed code. Backups in enterprise.



yugabyte is multi-raft i believe


> Because of this dedicated data warehouses…use column-oriented storage and don't have indexes.

Well, that’s not really correct is it. ClickHouse for one definitely has them as Snowflake the last time I used it.

This is a lot of work to go through to avoid using the right tool for the job. Just use something like ClickHouse or even DuckDB and reap the benefits of better performance with less caveats.


I had the opposite reaction: running analytics against the database you are already using feels like a lot less work to me than adopting an additional tool and solving the problem of synchronizing your data to it.


If the database stays small and the load stays low, sure.

But, as things grow, you tend to run into problems with analytical loads adversely impacting, and even knocking over, the production system by locking resources and causing timeouts. Especially if you're allowing analysts to run ad-hoc analytical queries.

Long story short, yes, resiliency is expensive, but it's not always more expensive than not having resiliency.


I do order the solution in "you should choose the simplest one that doesn't disrupt your environment" as:

Use a single database for transational and analytical workloads.

Replicate your transational database as is for analytical workloads.

Remodel your data and replicate to the exact same technology stack.

Remodel your data and replicate into specialized tools for analytics.

I have never seen anybody that actually needs the last one. But the largest environments I've looked are government databases with a few thousands of people working on (there are bigger envs out there).


Read replicas can take the brunt and have widespread availability in Postgres cloud providers.


The blog post itself makes a point of running this on a dedicated instance, so you’re already shipping data somewhere, and it’s not that unreasonable to assume that pretty quickly you’re going to need/want to start performing transforms on the data to get it into a more appropriate schema for analytic queries.

Synchronising the data can be done in stages as well. Daily loads are pretty straightforward and usually small/straightforward enough to get going quickly and maintain. As your needs increase you up the frequency or start employing more sophisticated methods.


It's correct.

Snowflake does not have indexes, and ClickHouse indexes are what they call "data skipping indexes". BigQuery, Redshift, Netezza, and Vertica also do not have support for indexes.


ClickHouse has both variants: primary and seconday indexes.

Primary key is range index - to quickly locate records. Secondary indexes are data skipping indexes - to quickly skip blocks of data.


> ClickHouse indexes are what they call "data skipping indexes".

That’s still an index though isn’t it? Might work slightly differently, but the purpose is still the same.


Not in the same sense we consider them in traditional databases, used to find specific rows (needle in haystack). ClickHouse indexes are used to eliminate data pages to reduce IO for range queries, not to find specific rows by value.


Row-store indexes (indices?) are used for both. The first few levels of a b-tree index can be used for discarding/narrowing ranges, the leaf levels are for pinpointing a specific row.

What I think you meant to say was that Snowflake does not offer user-manageable indexes, not that they don't use indexes at all.


Snowflake tries to auto-sort its containers to fit the query pattern; most of these platforms use at least one data ordering. Vertica allows multiple projections on a table with different orderings and distribution keys.

On top of that BRIN (block range) indexes are usually used to capture the value of sorting by pruning I/O. I don't see these mentioned here -- they seem like a good open-source version of this idea.


ClickHouse has support for block range indexes as well. If you want to check open-source implementations - this is probably the best place to do it.


In addition to other example cited here, Druid has bitmap indexes on dimension columns. So, it's hard to make a hard and fast generalization. All databases use indexes to a greater or lesser extent.

To my mind the big differences for data warehouses are the following:

(a) Table scans are relatively cheap thanks to columnar structure and compression. It's much more important to tune compression than indexes. If you can reduce stored data size by 10^3, you don't need an index. That's exactly the opposite of row stores like MySQL.

(b) Data warehouses don't use indexes to maintain referential integrity, because it's not something they really care about in the first place.

My DW experience is with ClickHouse, but I think it illustrates a lot of the principles.


From a quick look I didn't notice any mention of columnar storage. I would be very skeptical about any claim that a DB without columnar storage "can work extremely well as a data warehouse".


Take a look at Kimball/Star-Schema. It's worked extremely well as a data warehouse technique for decades. That said, I think modern offerings (e.g. Clickhouse) are superior in most use cases, but it's definitely not impossible on a traditional row-oriented RDBMS.


bear in mind that Clikchouse will quickly fall short when squeezed into traditional star schema model (it's very inefficient on multiple joins, probably even can't handle more than 1 at the same time). You would really need a dabase engine that is columnar first but still operate on SQL without too much hidden pitfalls, and that is quite often challenging


ClickHouse can handle multiple joins just fine and has for a while. I just gave a conference talk on CH this morning that discussed this exact topic, among others.

The fact is that for large datasets scans on denormalized fact tables parallelize well, which means you can (a) offer stable performance and (b) scale more efficiently. This is important for use cases like web analytics, where users play around with different dimensions and measures but still expect consistent response. Note also, dimensions for things like Year, Month, Week, and the like compress absurdly well. It is often way faster to scan these values than to join them.


Yes in Clickhouse you’d generally take a denormalized approach.


Slightly peculiar juxtaposition in subheading, "Differences Between Data Warehouses and Relational Databases". Most data warehouses are relational databases (like, e.g, PostgreSQL). I think you might want to use something like "Differences Between Data Warehouses and Operational Databases" in stead?

Also, under "Reasons not to use indexes", #3 says: "Indexes add additional cost on every insert / update". Yes, but then data warehouses usually aren't continually updated during the workday. It's a one-time hit sometime during the night, during your ETL run. (Not a Pg DBA, but presumably you can shut of index updating during data load, and then run it separately afterwards for higher performance?)


Usually you drop indexes before load, then recreate after.


We support multiple data warehouses on our platform. We recently had to do a bit of work to get Postgres running, so we wrote a high-level post about things to consider when running analytical workloads on PG instead of normal production workloads.


Thank you!


What would be great is some way to codify all this advice. Eg run PG in a Data Warehouse mode. Databases are too big and too configurable and without specialist DBAs any more most people are just guessing what works best.


That's a great point.

This isn't really what you're saying, but citus [1] ships a distributed Postgres. A lot of the things they improve would help massively with analytical workloads actually.

1: https://www.citusdata.com/


You misspelled cit<r>us, the same is in the blogpost.


Also, what's your take? Do people use citus for analytical workloads as well as production at scale?

I'd assume yes, but I haven't personally used you guys. I'm just aware of you and broadly how you scale Postgres.


Sorry, I am not familiar with citus.


Ahh! So sorry. Fixed it.


Maybe I missed it but there's no mention of denormalizing strategies or data cubes, fact tables and dimension tables. Structuring your data closer to how its going to be analyzed us vital for performance. It also gives you the opportunity to cleanse, standardize, and conform your data.

I ran a pg data warehouse in the 8.x and 9.x days with about 20TB of data and it performed great.


I think you're right but it's a bit out of scope. Hard to give generalizable advice around this I think.

What we personally do in practice is put everything into a single time-series table with 11 columns. [1]

1: https://www.activityschema.com/


Postgres has features that help these sort of OLAP type workflows. Things like:

* partitioning your fact/aggregate tables (which was mentioned)

* rolling up old data reducing granularity as data ages can help with record count and overall db size

* PG triggers and stored procedures can be used to manage slowly changing dimensions

* the hstore and json column types are super useful for implementing quazi-nosql storage along side traditional relational storage

* window functions and CTEs (in 12) are great for writing analytical style queries

* Implementing incremental loads with a staging/buffer table and possibly with plpgsql can really make connecting it all easier

Not saying I didn't enjoy the article. It always makes me happy to see people realizing how suited PG can be to analytical workflows especially for small to medium workloads which represents most of what people want to do.


One would assume a data warehouse professional would know these things. This blog post was specifically about pg features/tips for building a dw.


How did you store that much data in PG? Did you have some sort of sharding mechanism?


Postgres can handle a lot more data than that without sharding and still perform well, especially for predictable table-scan-and-filter kinds of queries (which are pretty much what you want to run if you're doing the kind of denormalization the post you replied to mentions). Of course a column store would perform even better in many of those cases due to better compression / vectorization opportunities, but Postgres is really no slouch compared to other fully featured databases (research databases can blow it away though).


So, 20TB on a single node? Am I understanding this correctly? I guess it would be mostly for ETL stuff, as you would be swapping indexes in/out of memory too much without a lot of care.


Yes you understood correctly, 20 TB on a single node. And this was in 2007. The data was email and direct mail campaign and response data.

Typically only the more recent data was analyzed.

We retained all source data sou that reprocessing could happen when adding facts or dimensions.

Analysis, scheduled or ad hoc, only ever happened against the data cubes. Their partial aggregate nature along with partitioning was part of how the performance was so good.


Depends on the kind of index (BRIN are quite effective for large tables), how many rows we're talking about and how the column data are distributed, and how much of the data is actually hot, but yes.


> vacuum analyze after bulk insertion

Ugh, this is nightmare. I wish they would come up with a better system than forcing this on users.


Luckily Postgres' autovacuum works really well in normal workloads. If there's an even mix of inserts spread throughout time then it's probably best to just rely on it.

For data warehouses inserts can happen in bulk on a regular cadence. In that case it can help to vacuum right after. I'm not sure if it has a huge impact in practice.


Nightmare is bit of an overreaction. It's just a single command at the end signifying "Ok, done for now". It's not forced on users, the system works just fine if you don't do this. It just works better if you do let the system know when is a good time to perform maintenance tasks and collect statistics.


Postgresql is great for OLTP workloads out of the box. I don't think that it's easy (or even possible) to be a great database for both OLTP and OLAP workloads without any tweaking or input from the user.


It's interesting to see how PG can be configured, but why not just use a real warehouse like Snowflake?

Also, do you have any numbers on how PG performs once it's configured?


I'm glad I'm not in the situation of needing to make the judgement call, but Postgres' ecosystem might be part of the answer. For example, Snowflake has basic geospatial support, but PG has insane and useful stuff in PostGIS like ST_ClusterDBSCAN: https://postgis.net/docs/ST_ClusterDBSCAN.html

Foreign data wrappers are another thing that might be compelling — dunno if Snowflake has an equivalent.

I don't have any numbers but PG has served me fine for basic pseudo-warehousing. Relative to real solutions, it's pretty bad at truly columnar workloads: scans across a small number of columns in wide tables. The "Use Fewer Columns" advice FTA is solid. This hasn't been a deal breaker though. Analytical query time has been fine if not great up to low tens of GB table size, beyond that it gets rough.


Yeah, I think using Snowflake or BigQuery or something is ultimately the better move. But sometimes folks use what they know (what they're comfortable managing, tuning, deploying, whatever).

In my own testing PG performed very similarly to a 'real' warehouse. It's hard to measure because I didn't have the same datasets across several warehouses. Maybe in the future I'll try running something against a few to see.


I really wanted to migrate an analytics project I was working on from Elasticsearch to Postgres: however, when we sat down and ran production-scale proofs of concepts for the change, ClickHouse handily outclassed all the Postgres-based solutions we tried. (A Real DBA might have been able to solve this for us: I did some tuning, but I’m not an expert). ClickHouse, however, worked near-optimally out of the box.


Snowflake and big query are cloud solutions. Some companies have a need for self hosted databases etc.

And often having a homogenous database stack is a plus. If your production systems are all MySQL, then trying to get away with using MySQL for analytics too is a smart move etc.

I’ve seen so many tiddly data warehouses. Most companies don’t need web scale, and they overbuild and over complicate when they could be running on a simpler homogenous stack etc.


> why not just use a real warehouse like Snowflake?

Cost is the most common reason I’ve seen. An RDS instance is about 1/2 the cost of Redshift per CPU and then Snowflake is slightly more expensive than Redshift (often worth the extra $$).

Also, if you’re dealing with less than 10GB of data the difference in performance will be barely noticeable, so at modest scale the most cost effective solution is RDS.


> why not just use a real warehouse like Snowflake?

Because you don't want to learn and maintain a new kind of software in a small organisation with limiter resources, for example. Each new language, framework, database and other kind of tool adds a lot of cognitive load for everyone involved.


If your project has less than a few million rows, chances are PostgreSQL will run any analytical query you like in just a few seconds. So you can keep things simpler by avoiding adding any extra tools.


> why not just use a real warehouse like Snowflake

Customers who demand / are legally obliged to ensure their data does not leave their territory, is one big reason.

Snowflake give you some options, but if you get a big contract with a customer in another region your entire analytics platform is unavailable to that customer.


I almost don't want to read this because for any company I work for its almost pointless to try convince them not to use Redshift. I'm slowly seeing the future of software engineering just us being switchboard operators for AWS.


Beware that Google Data Studio's PostgreSQL connector is severely limited as it is not meant for this type of work. I suspect that other BI tools may have the same problem. For us, BigQuery made a lot more sense.


> ensure you're not I/O-bound

The biggest thing holding me back from using pg as a data warehouse is RDS not having support for instances with ephemeral drives / cost for PIOPS.

I need 100s of thousands of iops not thousands.


Combine these tips with those: https://tech.fretlink.com/build-your-own-data-lake-for-repor... (I'm the writer of that linked article) and you get a really powerful, fully open-source, extensive data-warehouse solution for many real-life use cases (when your data doesn't exceed the 10^9 order of magnitude for number of rows).

Thanks Cedric for sharing your experience with using PG for data-warehousing <3


Postgres is great, but honestly, why not use something column-oriented that’s built and optimised for this?


Sure why not -- I mean redshift is a cluster of Postgres databases as well. What you don't get is scale -- so its fine for a small data warehouse (in which case is it really a data warehouse...)


I am not sure I agree with the general idea that Postgres can't or even--albeit a bit less strongly--that it is hard to scale. Even in 2008 people were running petabyte-scale warehouses using Postgres:

https://www.toolbox.com/tech/data-management/blogs/2-petabyt...

Since 2008 improvements in parallel query execution (and numerous other improvements) in the core project plus the availability of forks/extensions which abstract and/or modify various bits for improving scalability (see Citus and Timescale) it's never been easier to scale Postgres to some truly staggering heights.

While I wouldn't want to speak in absolutes, there are very few applications where I think Postgres wouldn't be a viable choice as a data warehouse.

Emphasis on warehouse as I wouldn't want to suggest Postgres as an ideal candidate to be a data lake. The difference between them for me being whether or not the data is structured/processed. Similar in definition to this article:

https://medium.com/@distillerytech/data-warehouse-vs-data-la...

Personally, I have experience scaling core PostgreSQL (9.4) to handle ingestion of monitoring data for web servers to the tune of 2-3 terabytes a day. Not the grandest of scales, but enough to have seen a few bumps along the way...and, for what it's worth, I think it is surprisingly easy to scale.

I wouldn't want to sign up to scale Postgres to handle exabyte data loads, but single digit petabytes? Sure.

https://techcommunity.microsoft.com/t5/azure-database-for-po...

And at petabyte-scale, I personally think it qualifies as a data warehouse.


How his compares to other columnar stores of this kind? e.g. MariaDB ColumnStore? others?


No columnstore option like SQL server?


Citus, cited at the end, is a column store (single node or distributed)

https://github.com/citusdata/citus


None baked in (yet). Maybe in 15 or 16; Zedstore (a project for a PG columnstore table access method) is slowly getting more and more feature-complete, and might be committed to the main branch in one of the next few major releases.


I'm curious what people think of Swarm64 and/or TimescaleDB on that front


Probably going to get downvoted for this, but I feel like MongoDB should get more love in these subs.

We use it all the time and their aggreations can get really advanced and perform well to the level where we run most analytics on demand. Sure we're not pushing to “big data” levels, max a few 100k records, but in reality I believe that's the average size of the majority of business datasets (Just an estimate, I have nothing to back this up)

Been building with it for 5 years now and it's been a breeze. Especially with Atlas. I think our team has not spent more that 3 days in total on DB dev ops. And with Atlas Lucene text search and data lakes, querying data from S3. What's not to love.


I have used Mongo and it is a total nightmare, a bit of the reasons are in your text. Its supposedly only good with Atlas which is a managed service.

I haven't tried Atlas myself (and why would I? - I try and avoid lock in) but since Postgres supports json column types which has been my go to instead of Mongo & it has been an absolute breeze. Especially since it can be indexed and scanned with postgres sql.


> Sure we're not pushing to “big data” levels, max a few 100k records, but in reality I believe that's the average size of the majority of business datasets (Just an estimate, I have nothing to back this up)

Excel on a laptop is also a viable option at that scale


In analytic workloads MongoDB gets absolutely eviscerated on performance, it is absolutely not optimized for OLAP. Obviously, you've got to go with what you know and it's great that it works for you (though at your data sizes, I'm pretty sure you could also perform quite well by just loading all the data into a vector in memory in your application, or maybe even put it in Excel as another comment mentions :)), but generally speaking this article is aimed at people whose data is already in Postgres or are familiar with it and want to leverage it for OLAP. If that's the case, putting that data in MongoDB and performing worse on basically all fronts isn't a very compelling idea.


Yes, if you're only going to have 100k records, then basically any type of store will work, and you should choose purely on ergonomics. At that size, you can get away with not even having indexes on a traditional sql database.


Mongo is nice to work with for applications, but for analytics it's a bit of a nightmare - mostly around non compatibility with external tools.


Yeah, for this blog, it was optimizing a Postgres table with 100m records. so 1000x more and thus all these issues came to be.


You are correct that most business datasets are small.

However, once you get to a 600 M row dataset, you will likely encounter significant performance/cost issues with a noSQL database.


Even not large banks/fintech often add more records daily.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: