
Rarely Used Postgres Datatypes - narfz
http://www.craigkerstiens.com/2014/05/07/Postgres-datatypes-the-ones-youre-not-using/
======
zrail
One of the coolest things about Postgres is how easy it is to make your _own_
datatypes[1]. At previous jobs we've had custom types and aggregates that
allowed super fact set operations with comparatively small data storage
requirements.

[1]: [http://www.postgresql.org/docs/9.3/static/sql-
createtype.htm...](http://www.postgresql.org/docs/9.3/static/sql-
createtype.html)

~~~
callesgg
Jup I recommend creating [Postgres domains] for everything that is not
specifically already a Postgres type.

Want to store an order id create a OrderId domain. Want to store weight create
a gram domain. Want to store quantity create a quantity domain.

That way it becomes much more easy to spot logical errors and maintain data
consistency.

Also one can use it to create something like an auto ORM.

[A domain Is like a user defined child type of the system types that can have
some extra restrictions.]

~~~
NoMoreNicksLeft
They already have a quantity domain... it's called "integer". While I agree
that domains are important and should be utilized, I hope your examples are
exaggerations.

~~~
ericcholis
I love the idea of domains, but quantity does seem to be a poor use case.

~~~
joevandyk
You could use use a domain to ensure a quantity can't be less than zero.

~~~
NoMoreNicksLeft
This is also called "unsigned int". Postgres has those.

------
mahmud
If your ORM doesn't let you take advantage of these goodies, maybe it's time
you went past ORM? Java's JOOq library is stellar at this kind of raw SQL
access stuff, while still being expressive and well-typed:

[http://www.jooq.org/](http://www.jooq.org/)

~~~
ebiester
How much is it?

"If you have to ask, you can't afford it."

~~~
hueho
I was surprised to see jOOQ went comercial. Still is open-source, though. The
paid versions apparently just include direct support, warranty and work out-
of-box with Big Corp databases.

[http://www.jooq.org/licensing](http://www.jooq.org/licensing)

~~~
mahmud
What is wrong with paying for software? It's very reasonable license too.
Cost-free for open source databases and without immediate support. This is
classic Free Software licensing.

~~~
saryant
Slick, a DB access library for Scala, follows the same license pattern.

[http://slick.typesafe.com/doc/2.0.1/extensions.html](http://slick.typesafe.com/doc/2.0.1/extensions.html)

~~~
lukaseder
Although, unlike us, Typesafe is completely intransparent with respect to
pricing both for licensing and support subscriptions. Have you asked how much
it will cost to go to production with Slick and Oracle database? You will be
in for a surprise!

With jOOQ's transparent licensing strategy there are no strings attached.

------
wlievens
Let's certainly not forget the awesome PostGIS extension with its spatial
types and functions.

~~~
craigkerstiens
PostGIS absolutely deserves some attention, as it's one of if not the best GIS
DB out there. It felt too big to include in here, but I should at least add
some small callout.

~~~
heydenberk
Thanks for the article! I started using postgres because of one of your slide
decks and I've learned a lot from your writing over the last year or so.

------
radiowave
Hstore, and the supplied functions for converting to/from records has been
very useful for making a "one size fits all" history table.

Currently I'm working with numeric range types to model order quantities and
stock quantities, and forecasting which order is going to consume what stock
by joining on rows with overlapping ranges, then taking the intersections of
those overlaps. Again, Postgres supplies functions and operators for range
overlaps, intersections, etc.

In the absence of those datatypes, there'd be a lot more work required to
achieve either of these.

------
apinstein
The ltree module is really awesome. If you have to do any hierarchical stuff I
feel it's much better than nested set implementations. Great performance with
gist indexes and high level operators for querying.

------
joevandyk
The ip4/ip4r data type is fantastic as well if you just need to store an ip
address.

([https://github.com/RhodiumToad/ip4r-historical/blob/master/R...](https://github.com/RhodiumToad/ip4r-historical/blob/master/README.ip4r))

~~~
craigkerstiens
Joe, this is definitely an awesome datatype, what is painful is building it
typically or at least it was years ago when I dealt with it. I'm sure it's
gotten better but still, I'd love to see this packaged up as a contrib just
like hstore.

~~~
jeltz
If you run Debian (or a derivate) you can download it using apt-get from
apt.postgresql.org, and then just run "CREATE EXTENSION ip4r;" to install it
into a specific database.

[http://wiki.postgresql.org/wiki/Apt](http://wiki.postgresql.org/wiki/Apt)

------
yawboakye
Why aren't they getting a lot of use? Is it because they're new or ORMs don't
have them yet? (Note: This might sound like a false dichotomy but these are
the only 2 reasons I know so feel free to add more reasons.)

~~~
craigkerstiens
Hstore is definitely an odd one in that it's an extension so sort of core,
sort of not. Other's its likely just that people default to what the ORM does.
Hstore actually sees 13% usage across production databases on Heroku, so while
something I'd still consider it much lower than what it probably should be.

------
troyk
I keep seeing the uuid idea for pkeys, but have yet to see them used by any
big site. Last I looked, twitter, facebook are using bigints all over the
place. Also, instagram went bigint and blogged about it here
[http://instagram-
engineering.tumblr.com/post/10853187575/sha...](http://instagram-
engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram)

Also, I tried the uuid extension once... It is not well supported, had to make
a small change to a c header file to get it to compile on ubuntu, for dev on
os x I think I gave up.

~~~
mantrax5
UUID aren't sortable by order of creation (on a per server basis which is
useful in distributed algorithms), and while you'll hear a lot of talk about
meteors, lightnings and winning the lottery, _you do risk_ collisions (yes,
when it comes to thousands of rps per server, millions of servers, and
extended periods of time, you reach ridiculous collision rates even on a
128-bit UUID, thanks to the Birthday paradox).

UUID has no guaranteed properties you can rely on in high volumes.

The only desirable property it has is that it's an easy way out in tutorials.

The better solution is simply longer to explain (have a unique name for each
server, an incremental generation number every time the server is restarted,
and an incremental number on top of that for the runtime of a given generation
- way Erlang does it).

~~~
benjiweber
Wikipedia claims birthday paradox means

"only after generating 1 billion UUIDs every second for the next 100 years,
the probability of creating just one duplicate would be about 50%."

[http://en.wikipedia.org/wiki/Universally_unique_identifier#R...](http://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates)

~~~
mantrax5
That's not as much as you might think in computer systems.

Now, if you want to mark your product SKU with an UUID you have an outstanding
chance it'll be unique amongst other UUID SKUs in the scope of every shop
selling your SKUs.

But people have been captured by the idea UUID is truly "universally unique",
so you can use it for everything, at any volume.

Not at all. Let's say that we're implementing a distributed Actor system (or a
similar message passing architecture) at the scale of Google. We'll be using
UUID to tag every message to guarantee its identity and various messaging
properties (like deliver just once etc.), because UUID is unique. While actor
systems can reach millions of messages per server per second, here we'll use a
humble 100k messages per second.

\- They have over 2 million servers (2,000,000)

\- Each of which generates 100k messages per second.

You reach a point of 50% collision rate ( _for every new UUID_ ) in 3 years:

    
    
        log2 (2,000,000 * 100,000 * 3600 * 24 * 365 * 3)= ~64
    

2^64 is roughly the number of 128-bit UUIDs you need to reach that collision
rate due to the Birthday paradox.

Now you might say, I have a lot of variables pegged to worst case scenario.
Sure I have. But this is about 50 damn percent chance of collision on _every
new UUID generated_ at that point.

Things will become bad long before 50% collision chance if you use UUID.

~~~
Someone
_" But this is about 50 damn percent chance of collision on every new UUID
generated at that point."_

That's incorrect. If you go back to the birthday paradox, what you say would
mean that, if you have 23 people in a room, and a 24th walks in, there is a
50% chance that that new person shares their birthday with one of those 23.
That clearly is incorrect; that probability is at most 23/365 < 0.10.

Also, I don't see how your formula proves it. You will go through 64 of the
128 bits of key space in 3 years, but that means you only got through 2^-64
part of the key space, so each next UUID has a chance of about 2^-64 of a
collision.

It's the sheer number of lottery draws with ever increasing probability of a
loss that introduces the birthday paradox, not the last lottery.

~~~
mantrax5
Indeed, thanks for clarifying my overly dramatic point with a more proper
definition of the Birthday paradox.

Yet, even once corrected, 50% chance for collision in a set of 2^64 UUID
(which can occur way earlier than 2^64) is way too much for me to go to sleep
at night, knowing there are more efficient, smaller (often twice smaller, ex.
64bit PK instead of 128bit PK), guaranteed collision-free ways for producing
PKs.

And there's another problem with UUID collision rates. Many versions of the
UUID rely on PRNG, and PRNG quality varies wildly from system to system.

A defect in the PRNG (it has happened) can start producing UUID with orders of
magnitude higher collisions than the model offers. It's a problem that's
better not to have.

~~~
andrewaylett
While I know it's out of favour, you don't have to create type 4 (random)
UUIDs. Including a machine identifier and timestamp in the generation of your
UUID guarantees uniqueness, so long as each machine is careful not to generate
two UUIDs 'simultaneously'.

This may bring issues in untrusted contexts, but it doesn't involve global (or
indeed any) synchronisation outside of each machine, and you may define your
'machine' for UUID purposes to be any size you like -- one per core may work,
or one per application, or per application thread even.

~~~
mantrax5
> "Including a machine identifier and timestamp in the generation of your UUID
> guarantees uniqueness, so long as each machine is careful not to generate
> two UUIDs 'simultaneously'."

That's a pretty big if there, especially considering how unreliable timers
are, and the fact they can jump back and forth.

\- Wall clock timers will jump forward or back and repeat time after NTP
adjustment, or anyone else who adjusts the clock.

\- Internal "elapsed time" timers, based on say the CPU clock may jump around
or repeat when the source core changes on multicore machines, or you have
processes sourcing their CPU clock from the core they're bound on.

\- A node may be moved from machine to machine (and their clocks are not
necessarily perfectly synched).

This is a hard lesson that distributed system designers learn over and over
again: don't rely on random and don't rely on timers for identity. Good old
incrementing counters have none of those issues and are absolutely trivial to
implement. They never fail, never drift, never repeat, by definition. As I
said a few posts above, the typical structure of a "unique id" based on a
counter has three counters:

    
    
        1. nodename (or machine name if you will)
    
        2. namespace (you can have one per thread to avoid contention issues)
    
        3. local (monotonically increases within the namespace).
    

Let's say your node id is 256, and it runs on 4 threads, one per core + 1
thread for the scheduler. You need five namespaces:

256.0-4.* (where * is increasing monotonically 0, 1, 2, 3, 4...)

And once you have to reboot the node, you obtain five new namespaces:

256.5-9.*

And the * counters start over from zero on each. As for how big should each
segment be, that depends on what you want to do with the nodes I suppose. But
the 128-bit UUID size should never be some kind of guide regarding size; maybe
you need three shorts, maybe you need three longs, maybe short.medium.long
etc., it'll depend on the use case.

------
bwooce
the only bit I'd prefer clarification on is:

 _Timestamp with Timezone_

 _There’s seldom a case you shouldn’t be using these._

For recording most event times that's true but:

Birthdates (which can include times) shouldn't use them.

Calendaring should think very carefully too, should the 9am meeting move to
10am due to DST? or should it "stay put" but potentially move for other
timezones?

~~~
ragsagar
_Birthdates (which can include times) shouldn 't use them._

I don't understand why you don't want to use timezone with birthdates(with
time). Can you explain this, please?

~~~
exogen
I think because the common usage of birthdates doesn't take the timezone into
account. e.g. if you were born on January 1st 9pm PST, that's January 2nd 12am
EST – but your birthday is still considered January 1st no matter where you
are in the world. So trying to correct such dates for timezones will show
users their incorrect birthdate.

------
groupmuse
Kind of confused here: I was under the impression that timestamp with timezone
only did the converting to-and-from your system timezone, but still always
stored in UTC. Is that incorrect?

~~~
saurik
You are correct.

------
malkia
I wish arrays and hstore-like types were available in sqlite (not through
plugins, but standard-wise).

I've used something else from my experiments in PostgreSQL - the TXID - this
way I was able to track down changes in the database (by keeping previous TXID
and some other various bits, and then by polling again (or making the server
call me)) - polling again and only instructing to get the rows that have
changed since my last TXID.

------
Alex3917
The array type is actually one of the most interesting, I'd love to see the
performance implications of those specifically covered in more detail.

------
Ixiaus
Postgres is a great database, I'm particularly in love with hstore for the
data that it makes sense for.

------
dholowiski
While I don't disagree with the coolness of uuid, how exactly do you 'run out
of integers' ?

~~~
dmm
Here's a good example:

[http://slashdot.org/story/06/11/09/1534204/slashdot-
posting-...](http://slashdot.org/story/06/11/09/1534204/slashdot-posting-bug-
infuriates-haggard-admins)

Slashdot got over 16,777,216 comments overflowing a MySQL mediumint they were
using for an index.

~~~
daigoba66
Same thing happened to Twitter: [http://techcrunch.com/2009/06/12/all-hell-
may-break-loose-on...](http://techcrunch.com/2009/06/12/all-hell-may-break-
loose-on-twitter-in-2-hours/)

------
troyk
I think the money type is awesome! Under the hood it uses int (I believe int32
even on 64bit) so it takes 50% less space than decimal.

So if your only dealing with US currency, why not love the money type?

~~~
rpedela
Postgres developers disagree and recommend numeric. It was once deprecated and
then un-deprecated.

[http://www.postgresql.org/message-
id/b42b73150805270629h309f...](http://www.postgresql.org/message-
id/b42b73150805270629h309f3704i72c259081d8d393d@mail.gmail.com)

~~~
troyk
I don't know, I followed the thread and it didn't seem conclusive, esp with
the main criticizer stating "OTOH, it's a neat example of a fixed precision
integer type"

I use it a lot, and yes, it is missing some casts but you can cast to decimal
and go from there:

select '42.00'::money::decimal;

I've also used stripe, and maybe it's a javascript thing, but they represent
all their money in cents. Personally, I'd rather use the money type, it is
easy for me to reason about.

~~~
sitharus
Javascript represents all numbers as IEEE 754 floating point. Base 2 floating
point isn't good for representing base 10 decimals.

------
frankpinto
Yes, amazing. About to work on stats caching and that Ranges datatype will be
suuuper useful for using ranges of timestamps when caching a stat every week,
month, etc.

