An unexpected journey, a Postgres DBA's tale

limaoscarjuliet · on Aug 2, 2016

There are more surprises around Postgres XID to be had:

- Temp tables cannot be vacuumed by anyone else than connection that owns the table (no way around it) - do not keep temp tables too long, nightly vacuum or auto vacuum will not clear these,

- Temp tables do not get automatically dropped on recovery if your DB cluster crashes. They get reused but if you have many connections before crash and few after some namespaces will linger - you may need to manually cascade drop temp namespaces if you see the warning,

- Full vacuum does not touch XIDs - use freeze or just normal vacuum,

- Check oldest XID per table:

  SELECT c.oid::regclass as table_name,

        greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age

        FROM pg_class c

        LEFT JOIN pg_class t ON c.reltoastrelid = t.oid

        WHERE c.relkind = 'r'

        ORDER by age DESC

        LIMIT %s

- Check oldest XID per db:

        select datname db, age(datfrozenxid) FROM pg_database ORDER BY age DESC

anarazel · on Aug 2, 2016

> - Full vacuum does not touch XIDs - use freeze or just normal vacuum,

Don't think that's right.

      copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
  			   bool *pSwapToastByContent, TransactionId *pFreezeXid,
  			   MultiXactId *pCutoffMulti)
  ...
  
  /*
  	 * Compute xids used to freeze and weed out dead tuples and multixacts.
  	 * Since we're going to rewrite the whole table anyway, there's no reason
  	 * not to be aggressive about this.
  	 */
  	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0,
  						  &OldestXmin, &FreezeXid, NULL, &MultiXactCutoff,
  						  NULL);
  
  /*
  	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
  	 * backwards, so take the max.
  	 */
  	if (TransactionIdPrecedes(FreezeXid, OldHeap->rd_rel->relfrozenxid))
  		FreezeXid = OldHeap->rd_rel->relfrozenxid;

i.e. the cutoff is computed in copy_heap_data(). And then in

      void
      rewrite_heap_tuple(RewriteState state,
  				   HeapTuple old_tuple, HeapTuple new_tuple)
  {
  ...
  	/*
  	 * While we have our hands on the tuple, we may as well freeze any
  	 * eligible xmin or xmax, so that future VACUUM effort can be saved.
  	 */
  	heap_freeze_tuple(new_tuple->t_data, state->rs_freeze_xid,
  					  state->rs_cutoff_multi);

limaoscarjuliet · on Aug 3, 2016

I'm speaking from experience with 9.1, I perhaps this has changed since (5+ years of dev!). I ran an experiment to demonstrate no XID reset on vacuum full:

I ran a test:

XID age in callback 354015 (0% of max)

task=# vacuum full callback;

XID age in callback 354015 (0% of max)

task=# vacuum freeze callback;

XID age in callback 11 (0% of max)

anarazel · on Aug 3, 2016

Yea, but that's not the same as not freezing at all - it freezes stuff older than autovacuum_freeze_min_age.

the_mitsuhiko · on Aug 2, 2016

Are you sure the leftover temp table thing still exists? I have been using postgres for development on a project with temp tables and even though of being rough with postgres (killing, starting laptop in not so nice ways etc.) i can't find any leftovers.

limaoscarjuliet · on Aug 2, 2016

Leftovers will happen only when number of used temp namespaces before crash is larger than number used after crash. Then not all temp namespaces get reused and, as a result, reset after recovery. The leftovers then keep the XID around forever.

anarazel · on Aug 3, 2016

That doesn't sound right:

static void do_autovacuum(void) {

...

		if (classForm->relpersistence == RELPERSISTENCE_TEMP)
		{
			int			backendID;

			backendID = GetTempNamespaceBackendId(classForm->relnamespace);

			/* We just ignore it if the owning backend is still active */
			if (backendID == MyBackendId || BackendIdGetProc(backendID) == NULL)
			{
				/*
				 * We found an orphan temp table (which was probably left
				 * behind by a crashed backend).  If it's so old as to need
				 * vacuum for wraparound, forcibly drop it.  Otherwise just
				 * log a complaint.
				 */
				if (wraparound)
				{
					ObjectAddress object;

					ereport(LOG,
							(errmsg("autovacuum: dropping orphan temp table \"%s\".\"%s\" in database \"%s\"",
								 get_namespace_name(classForm->relnamespace),
									NameStr(classForm->relname),
									get_database_name(MyDatabaseId))));
					object.classId = RelationRelationId;
					object.objectId = relid;
					object.objectSubId = 0;
					performDeletion(&object, DROP_CASCADE, PERFORM_DELETION_INTERNAL);
				}
				else
				{
					ereport(LOG,
							(errmsg("autovacuum: found orphan temp table \"%s\".\"%s\" in database \"%s\"",
								 get_namespace_name(classForm->relnamespace),
									NameStr(classForm->relname),
									get_database_name(MyDatabaseId))));
				}
			}

limaoscarjuliet · on Aug 3, 2016

Aye, my stuff was from 9.1, seen it with my own eyes. I'm glad it improved since!

anarazel · on Aug 3, 2016

It's pretty much the same in 9.1.

radiowave · on Aug 2, 2016

At a slight tangent, I don't think I've ever used a temporary table without living to regret it.

- Queries that perform complex operations involving the creation of temp tables cannot be joined onto themselves[1].

- Operations that create temp tables can never be used as part of the definition of a materialized view[1]. This is a permissions issue, but not one that can be resolved by altering permissions. It's baked in.

- Confusion over how to test for the existance of a temp table, "it's in the catalog but it's not in my session".

[1] - Without resorting to dblink shenanigans, or such like.

mjevans · on Aug 2, 2016

Have you had better experiences when you modify the design to use an UNLOGGED table and update such a table within a transaction? (I assume most of the use cases are as a cache for data that actually exists elsewhere)

radiowave · on Aug 2, 2016

Yes. That's what I tend to end up replacing the temp tables with. It can be a bit fiddly to design, but I haven't had any operational problems arising from it.

In some of my cases, the temporary data is some kind of projection of future stock requirements, arising from a user doing a "what if" operation. Since multiple users could be doing this, the replacement unlogged non-temporary table needs to be "multi-tennant", which involves an extra column (usually called "invocation_id"), and a sequence, such that each invocation of the operation gets a new invocation_id value, which it uses in all of the rows it then generates. The extra complexity is ensuring the invocation_id gets passed around to all of the functions that are operating on the data, including the clean-up when we're done with it.

saurik · on Aug 2, 2016

It is my vague understanding that a lot of these XID vacuum issues are being massively improved by a patch that landed a few months ago.

http://rhaas.blogspot.com/2016/03/no-more-full-table-vacuums...

"No More Full-Table Vacuums" - Robert Haas

vinothgopi · on Aug 2, 2016

I don't think I’ve ever read a tech blog post that was so gripping, with twists and turns AND ends with a cliffhanger (until you click on the link to part 2)

the_duke · on Aug 2, 2016

Haha I agree.

For a second I was conserned about the level of my geekiness.

Then I moved on the the next HN article...

Femur · on Aug 2, 2016

I wonder is there a compelling reason that Postgres does not use 64-bit XIDs?

saurik · on Aug 2, 2016

There are four XIDs per tuple, so a 64-bit XID would make every row 32-bytes larger. (That's a lot larger.)

stickfigure · on Aug 3, 2016

...and yet still reasonable for many applications...

ivoras · on Aug 2, 2016

For the elucidation of the audience: one thing which I found out can kill PostgreSQL (and luckily there are few of those) is temp table over-use. Imagine a busy app where most page loads create and work on wide temp tables. What isn't directly obvious is that temp tables, like regular tables, have their metadata (attributes) stored in system catalogues, and the catalogues themselves apparently are very much (if not identical) like regular tables. So now, there's huge write traffic on system tables which by the nature of temp tables generates huge numbers of dead records, as information on every field of every temp table, etc. is created and deleted. I'm also pretty sure there's some ugly locking also in the mix.

So imagine now that autovacuum was turned off...

The user tables in such a system can survive if they are not updated often, with an occasional vacuum. However, the situation here was such that for every GB of user data, another GB has accumulated in dead records in system tables. Half of disk usage has gone to bloat in system tables!

It's easy to fix, of course, with a full vacuum of everything, if you have the time. Or more quickly, a full vacuum of system tables should be very fast, right? Right, except in that particular old version of PostgreSQL there was a race condition bug with vacuuming system tables which locks up the entire cluster with a possibility of data corruption for the db in question. Guess what happened.

A restore from backup.

whack · on Aug 2, 2016

Given the dangers of the XID wraparound, I wonder if it makes sense for Postgres to auto-launch a Vacuum Freeze, or some minimal variant of it, when it detects that a XID-wraparound is imminent.

pilif · on Aug 2, 2016

It does do that. But it runs at very low priority, so it might take a while.

But on the other hand, it starts warning you, first just in the log file, then on every query you issue way before it actually stops working, so if you're not ignoring warnings in your logs / your client-code, there should be ample time to react.

At least, there was for me when I first ran into this a few years ago.

hvo · on Aug 2, 2016

One of the main points of VACUUM is to protect you against XID-wraparound or multixact ID wraparound.So,your idea of auto-launch of a Vacuum Freeze is good but i still believe it is important to be aware that you may lose your very old data if you dont do anything about it.

hobs · on Aug 2, 2016

200,000 tables in one database? What?

jimktrains2 · on Aug 2, 2016

Image a case in which census data and the associated geometries. https://github.com/censusreporter/census-postgres has 22 surveys, each with 230+ tables. That's 5000+ tables right there. Now, the TIGER tables for all of that is another 50 tables per year, so another 350 tables.

If these were to be partitioned by state, instead of all records for all states in a single table, then we're looking at 270,000.

So, it's not _that_ difficult.

hobs · on Aug 2, 2016

Are the surveys so fundamentally different that I cant just store the 280+ tables with a uniqueifier for the 22 different surveys?

You want it by state, sounds like a column to add to my existing tables, not a way to make hundreds of new ones, or making a table with an FK and a state column.

logfromblammo · on Aug 2, 2016

Obviously, you have not often worked with government customers.

I can't even remember how many times I have explained what would be correct and maintainable, only to be told to do it exactly the same way the old system did it.

So if I were working with the census, I'd consider myself lucky to not have to build a GUI for poking the holes in virtual Hollerith punch cards.

dsugarman · on Aug 2, 2016

I worked with a small team (4 people) to rewrite the systems for determining GDP and other economic info for the US. We built an elegant system heavily utilizing dynamic SQL to keep the complexity in metadata and simplify the ridiculous amount of different ways information was collected and stored. In the end it went from taking days running ~20 COBOL programs in serial to around 20 minutes running one SQL script to get the results.

I understand this may not be typical of government contracting, but it is not fair to say that you can't make quality systems for government customers.

josho · on Aug 2, 2016

> only to be told to do it exactly the same way the old system did it.

Early in my career this pissed me off too. As I matured I learned that there were often valid reasons to do it the same way. E.g. External systems accessed the DB directly for reporting, special cases meant my proposed design would be fragile, etc.

To the younger folks here, when you encounter this realize constraints exist that you must workaround. If you feel strongly enough to push back then first seek out some senior resources to dig for a deeper understanding before going ahead and proposing an alternate.

logfromblammo · on Aug 2, 2016

The only reason you really need is that the customer wants it done that way.

As I grow older, I become less enthusiastic about protecting other people from the potential consequences of their decisions. At this point, I will still warn about those consequences, but only so that my company will get contracted again later, to do the thing I suggested in the first place.

  I: You could speed up this workflow by doing X.
  They: No... It needs to look exactly like this paper form.
  I: Your wish is my command.
  [3 months pass]
  They: We want you to change this form by doing almost exactly X.
  Boss: [holds out hand, rubs fingers together]
  They: [writes fat check]
  I: Your wish is my command.
  They: Just make sure that it still looks like the form when I print it.
  I: [dying a little inside] Your wish is stu... still my command.
  [3 months pass]
  They: The printed forms are crap.  We want you to export the data to an Excel spreadsheet.
  Boss: [holds out hand, rubs fingers together]
  They: [writes fat check]
  I: Your wish is my command.
  [continue in this fashion until your death]

hinkley · on Aug 2, 2016

Problems that nobody talks about never get fixed. If you give up the there is no feedback and no news is good news.

And one of the worst technical failures I worked on was a project where they insisted on reports from live data, and put little bits of report data in every response from the server.

They couldn't grasp why the performance was so underwhelming. Every index in your tables adds time to every insert and update operation, and often slightly stale data is close enough for most users and can be pre-calculated on an interval.

If you get the bright idea that measuring how many people are online or exactly how many dollars you made this month, then you will have fewer users and make less money because of it.

Use reporting tables. Most people shouldn't be writing into your data anyway and if they are, run away. Make the reporting tables your external API, and then you can add or modify columns, split or combine tables whenever you want.

Roboprog · on Aug 2, 2016

Hah. I have been working for UC for a while on a project revamp for the state (CA). Make the table and the screen look just like form # xxx ...

You got it :-)

phonon · on Aug 2, 2016

Heh--go all the way! http://www.codeproject.com/Articles/466362/Blend-PDF-with-HT...

jimktrains2 · on Aug 2, 2016

The surveys can change over time.

Also, the state is part of the identifier already, however, you may want to pull it out into tables for a variety of reasons.

dmuth · on Aug 2, 2016

What indeed. Having that many tables in a database is just asking for trouble.

My gut feeling for a database of this size is to suggest sharding. While I don't claim to be an expert in this sort of thing, here's how Pintrest approached a similar problem: https://engineering.pinterest.com/blog/sharding-pinterest-ho...

barrkel · on Aug 2, 2016

If you have a lot of variable schema datasets that you need to be able to index and query, you can do a lot worse than create a table per data set.

At my job, in our largest environment we only have 30,000 tables in our biggest database, but that's only a year or two's worth of data and the rate of increase is increasing.

hobs · on Aug 2, 2016

Interesting, without giving anything proprietary away, how large are they in terms of gb/rows/column count?

barrkel · on Aug 2, 2016

It entirely depends on the user's data. 2,000 to 7 million rows, with anything from three columns (rare) to 200+, 80% range would probably be 20 to 60 columns. Raw on-disk sizes can be anywhere from tens of GB to probably TB level - I don't track operations that closely any more.

We (du.co) make a reconciliation product for finance. Reconciliation is basically comparing two lists. The size of the data entirely depends on what the customer wants to compare.

overcast · on Aug 2, 2016

Can you give me an example of what you're doing exactly? Rapidly expanding the number of tables in a database doesn't make sense. Sounds like you need a flexible document store, and not a relational database.

barrkel · on Aug 2, 2016

Have looked at what's on the market, and the performance characteristics of what's available isn't as good as a relational DB, in terms of queries that join the rest of our schema, static data, only a handful of warm tables but quick ramp-up time, etc. Specifically we don't want a cluster solution (too expensive in terms of hardware) and our data is almost completely static once it's loaded (so we don't need something that e.g. keeps stuff in memory to optimize updates).

The closest is RethinkDB, but its memory requirements per table are too high.

If you have other suggestions for something that can live on a smallish box (4G) yet support thousands of tables, fairly arbitrary queries, and joins against a relational DB, I'm open to suggestions. Fast bulk updates and inserts are a bonus; a relational DML is a requirement, since a good chunk of our logic is in SQL, and much more readable for it.

overcast · on Aug 2, 2016

I was going to suggest RethinkDB, I've used it for tons of projects, and it's my current first go to DB. Combines the flexibility of a document store, with relational queries on them. Pretty awesome. Are the memory requirements that much more than having tens of thousands of tables open?

marcosdumay · on Aug 2, 2016

The idea of a flexible document store is not using thousands of tables. Instead, you get a flexible "table" that can store a big number of different data types.

For some extent, Postgres can do that. But I don't know the extent of flexibility/performance you'll need.

barrkel · on Aug 2, 2016

We have 10s to 100s of thousands of tables - literally, tables, with rows and columns, each column of a consistent type, that need to be sortable and filterable - with each table having low thousands to low millions of rows.

When browsing any one table, the entire table's contents - more or less - needs to be sortable on any column and served up in a paginated fashion.

A single table with heterogeneous data, with lookup by ID or some simple key, is quite far away from what we need. What we need looks almost exactly like an RDBMS table. It even has foreign keys into tables that are much more conventional.

gaius · on Aug 2, 2016

Agreed, RDBMSs are not really intended for this use case. I'd be tempted to just store the data as blobs in S3 in this scenario, since they're on RDS anyway.

barrkel · on Aug 2, 2016

Finance: need separate virtual machines, virtual LAN etc. per customer at a minimum. No third-party cloud. Store needs to run fairly efficiently on small boxes to make the economics work.

(FWIW, RDBMS has worked really well for this use case, for us. Having a fully supported relational language, efficient temporary tables that can spill out to disk without going into thrashing hell, a system designed to keep only the tip of the iceberg in memory, etc. can be useful.)

overcast · on Aug 2, 2016

While I won't argue that you need some type of relational tables, I wouldn't consider creating thirty-thousand of them working really well. I'd be curious as to how RethinkDB would work for you.

barrkel · on Aug 2, 2016

What evidence would you require for proof that it's working well? We have the evidence we need already: a working solution well within its hardware constraints and performing excellently. There aren't many operations that scale worse than linearly in the number of tables (primarily schema queries), and not many operations are affected at all. The biggest thing to remember is to turn off the background schema queries your database client might try and do.

(We are, however, using MySQL, not Postgres, primarily because of the lack of a decent replication solution at the time the app was built.)

overcast · on Aug 2, 2016

I don't really need any evidence, other than knowing someone is managing 30,000 tables, or worse, 200,000. I'm glad it's working for you, it just sounds like a complete nightmare to manage.

sciurus · on Aug 2, 2016

As this post demonstrates, once you have a setup like that you'd really benefit from having someone on staff (or on retainer as a contractor) whose primary skill set is postgresql adminstration.

falcolas · on Aug 2, 2016

Or just DB administration in general. Even a basic understanding of DB theory and how that theory crashes with reality can be invaluable.

eddd · on Aug 2, 2016

I lifted both of my eyebrows while reading that sentence.

zwily · on Aug 2, 2016

Imagine your app uses 500 tables, and you use a separate schema per tenant on a single database. With 400 tenants, you now have 200,000 tables.

hobs · on Aug 3, 2016

Oh I understood that completely, but you probably dont want to make one giant database for all your customers unless you really love engineering the crap out of things.

audleman · on Aug 2, 2016

This article mirrors my own experience with Postgres. When something goes wrong, furiously research for 10 days, try a bunch of complex things, then discover a simple command that gracefully fixes everything =)

crb002 · on Aug 2, 2016

Or just change XIDs to 64 bit...

colanderman · on Aug 2, 2016

An ironic comment, given yesterday's popular article: https://news.ycombinator.com/item?id=12202735

approachingtraj · on Aug 2, 2016

The title is misleading. The post only talks about Amazon's RDS which is not Postgres.

vinothgopi · on Aug 2, 2016

Amazon RDS is the managed version of Postgres (among other flavors of relational DBs). Maybe you are confusing it with Redshift.

https://aws.amazon.com/rds/

rplst8 · on Aug 2, 2016

Isn't Redshift also based on PostgreSQL? Edit: it is: https://en.wikipedia.org/wiki/Amazon_Redshift and RDS supports many different RDBMS flavors: https://en.wikipedia.org/wiki/Amazon_Relational_Database_Ser...

heneryville · on Aug 2, 2016

Redshift is based on PestgreSQL, but is vastly different. Redshift is a columnar store, and can't do things like create indexes. http://docs.aws.amazon.com/redshift/latest/dg/c_SQL_commands...