Hacker News new | past | comments | ask | show | jobs | submit login
Arrays in Postgres (craigkerstiens.com)
70 points by paratrooper on Aug 22, 2012 | hide | past | favorite | 39 comments



One of the differences with postgres is that it's OK to use interesting data types. Other systems treat it as though it were somehow wrong.

See my post here: http://thoughts.davisjeff.com/2009/09/30/choosing-data-types...

And no, using arrays is not an automatic violation of first normal form.


> And no, using arrays is not an automatic violation of first normal form.

One of the conditions of the 1NF is that each row and column contain one and only one value. While a NULL might be disputable, using multiple values as shown in the two examples (tags, items/price/quantity) are a clear violation of the 1NF. Only if the array data were truly "one value" (e.g., a vector) it would not be a violation. (I.e., I am not trying to contradict your statement, but I think it helps to add a bit of clarification.)


If you never need to query 'inside' the array, then I don't think that's a violation of 1NF at all. If you treat the entire array as an atomic value, then I would think that's still 1NF.


Even if you do need to query inside the array, how is this different from something like WHERE x like 'prefix%'? Is that a violation of 1NF too?


Arguably, yes, because you haven't really broken your data down as much as you need to; but this is a fairly contentious issue - just see http://en.wikipedia.org/wiki/First_normal_form#Atomicity. If you are frequently needing to query inside data, you may well gain from moving that 'prefix' into a separate column, or more drastically rethinking your data model if that's appropriate.


In this example prefix is something the user typed in. So textcolumn LIKE '%blahblah%'.


"One of the conditions of the 1NF is that each row and column contain one and only one value."

There is no useful differentiation between atomic and non-atomic values. A timestamp is made up of a date and a time, a string is an array of characters, a real number has a sign and magnitude, etc.

I suppose you could just use boolean only, which could be regarded as "atomic".


Arrays aren't really in the spirit of relational design. This means that using them will change the ACID characteristics compared to a relational design. In the article's first example, it would seem like adding another item to a purchase would cause the entire purchase row to lock - this wouldn't happen if items were stored in their own table.

That's not to say they aren't useful though; read performance would be much better with all the data living in a single row somewhere on disk. SQL queries would also be easier to write.


It's a slightly contrived example, because in a real-world schema you would want enough information about each line item that an array isn't practical. But I don't think locks are an issue here, because no sane business process would need different transactions that were updating the same purchase's line items in conflicting ways.


"Arrays aren't really in the spirit of relational design."

Why do you say that? To a certain extent I agree in broad terms; but I don't believe that it is somehow "unclean" or un-relational just because you are using an array (or any other complex data type).

"This means that using them will change the ACID characteristics compared to a relational design."

No, arrays are protected by ACID as is any other data in postgres.

"it would seem like adding another item to a purchase would cause the entire purchase row to lock"

It will still allow concurrent reads of the row in postgres (MVCC). In order for the write lock to be a practical problem, there would have to be a lot of concurrent updates to the very same purchase.

I think we can all agree that the first example is "hacky" (which the author says in the article), so let's ignore that one. How about the second example? I think it's pretty reasonable to store tags that way, if for no other reason than it could save a lot of space.


I assume by complex data type you mean something like (say) complex numbers, IP addresses, etc. The key difference is that the components of a complex data type don't stand on their own, or it's a violation of 1NF. I wouldn't mind storing a complex number in one column instead of two largely because I'd rather math with complex numbers "just work" and I'm unlikely to go updating the database for just the imaginary part. Likewise with IP addresses, it is useful to query them with netmasks and the like, but you're unlikely to query or update some middle component of them.

The interesting thing about arrays is depending on the work you're doing with them, they could either be a violation of 1NF or not. For example, if you're storing vectors in the physics sense, you may not be querying or updating their contents, in which case it's not a violation of 1NF because it's an atomic value. The danger of encouraging people to use Postgres's arrays is a fair number of the people reading are developers and will take this as carte blanch to store repeated values or sets as arrays, and miss out on a bunch of the power and elegance of relational databases (and probably cause themselves a lot of pain dealing with a type that has no direct and helpful JDBC mapping, for example).

> if for no other reason than it could save a lot of space

I discourage people from worrying too much up-front about the space utilization of their database design for several reasons. Intuitions about database storage and performance tend to be incorrect. Indexes usually matter a lot more in terms of disk use than the exact types of the columns. It's also a bad idea to spend a lot of energy optimizing the database before you have a significant amount of real-world data. Query performance alone varies dramatically between fake and real-world data sets, in part because the query planner will choose different strategies based on the sizes of tables and indexes and also because you usually don't know ahead of time what your access patterns will look like.

When it comes to databases, there really aren't very many hard and fast rules compared to rules of thumb. The rule of thumb on arrays is don't use them, and there are good reasons for that. It could be that you see massive performance and disk savings with tags and the loss of relational correctness doesn't bite you in a particular application, but you should investigate before choosing. All optimizations are trading flexibility for performance. You should make sure you don't need the flexibility and you do need the performance before making the trade.


"The key difference is that the components of a complex data type don't stand on their own, or it's a violation of 1NF."

The characters or substrings that make up a string stand on their own. In fact, it's pretty hard to differentiate a string from an array of characters.

Similarly, the date component of a timestamp stands on its own, and timestamps are routinely broken up in that fashion.

"The interesting thing about arrays is depending on the work you're doing with them, they could either be a violation of 1NF or not."

In other words, you're saying that queries are a violation of 1NF, rather than the schema?

What about integers and modulo division? Is modulo division considered to be accessing a component of an integer?

"people reading are developers and will take this as carte blanch to store repeated values or sets as arrays"

There's no substitute for careful thought when designing a schema.

"Intuitions about database storage and performance tend to be incorrect."

Good point.

"loss of relational correctness"

I still don't think you've justified this. Remember that the normal forms are formally defined, so you should be able to show a clear distinction, which I've not seen yet.

"All optimizations are trading flexibility for performance."

What particular kind of flexibility is lost in this example (product tags)?

"The rule of thumb on arrays is don't use them, and there are good reasons for that."

This really sums up the situation. It's easy to use arrays for the wrong things, so the rule of thumb is to never use them. But they are still useful often enough for developers to get frustrated, and the rule of thumb may cause them to invent a worse solution elsewhere.

We should make the good reasons known, rather than just the rule of thumb.


"The characters or substrings that make up a string stand on their own. In fact, it's pretty hard to differentiate a string from an array of characters."

It's difficult to differentiate a number from an ordered n-tuple of bits, but that doesn't make storing the number 123 storing an array.

I took the parent post's point to be clear: it isn't whether you can look at something as, in some sense, an array, it's whether you treat it as an atomic value. That's generally what you do with numbers: you update the whole number, not the third bit. And it's likely what you do with strings---even if you just want to change one character, you update the whole string, not the third character.

Likewise, if you want to store a complex number as a 2-tuple, you could treat that as an atomic data type even though it consists of two elements by always operating on the pair*. That's the respect in which it "stands on its own"---you don't peer inside.


I think you're getting hung up on the low-level details and missing the big relational picture.

Timestamps are atomic units. If I add 60 minutes to 10:15 AM I don't want to see 10:75 AM, I want to see the 11:15. This is because the "components" of the timestamp are not independent values. Likewise, a string may be an array of characters, but a name is an atomic entity. It's meaningful to break a string into characters (low-level) but it's not meaningful to break a name into name-characters.

Now, if I had a reason to make a relationship between one entity and many characters, the correct way to model it in the relational database is to use a one-to-many with a character column. Using a string instead is not the right thing to do. Strings, for starters, have an ordering, and sets do not. Indexing that string column probably won't perform as well as indexing the separate table, because string indexes care about that order. It may perform worse, or model the situation worse, for a variety of reasons. In practice, I've never seen anyone make a relationship between a character and an entity.

The distinction between when I'd use a separate row for each character and when I'd use a string is exactly the distinction between when using an array is 1NF and when it isn't. I don't think the idea of avoiding repeating groups says anything concrete about data types. I'd be surprised if the theorists spent a great deal of time discussing the philosophical ramifications of strings being composed of characters or what each type means on-disk; at least 40% of every book by Joe Celko consists of admonishment to avoid worrying about the physical layout. I'm sorry if this comment isn't sufficiently formal for you, but it's the best you're going to get from me. Lord knows this won't be the first time database theory and practice have not matched up to everyone's satisfaction. Maybe next we can discuss implementing transitive closures in portable SQL?

Back to arrays, the basic loss of flexibility is portability. If you modeled it traditionally with tables, your result would be portable to any RDBMS, including whatever indexes and constraints you made. By modeling it with arrays, you're limiting yourself to Postgres, and you're probably forcing yourself to use check constraints to ensure uniqueness within each array (if you care). The second loss of flexibility is that generic database access libraries probably don't support arrays natively. Hibernate, for example, does not "do" arrays, so you'll either be bypassing it altogether or converting them to strings and parsing them on the way out. These are the reasons I have off the top of my head; I wouldn't be shocked if there were others.

I don't think I'm going to see a third option besides arrays and normalization, but in this case, it would almost always be fine for the developer to err on the side of normalization. Are there uses for them? Sure. Just not many.


Arrays are actually pretty cool in PostgreSQL and keep getting cooler. Unnest() turns arrays into relations for example.

One of my primary uses for arrays is for passing data to/from the application. Data may not be stored in the db as an array, but it really makes passing complex data structures to/from stored procedures a lot easier.


There is nothing that unique about the inline support of other data structures. Many databases have done it in the past e.g Oracle and do so today especially in NoSQL land e.g. MongoDB, Cassandra.


I think Jeff was principally referring to relational dogma that is so prevalent in other SQL database system culture, even though they might implement such types to get a check-box (or not: MSSQL doesn't support arrays, AFAIK) Clearly nested and record tagged structures are popular in other database systems, but in the universe of SQL systems Postgres has always broken rank with tradition with most of its relational brethren (exception, due to similar parentage: Illustra and then Informix) when it came to data types -- for example, the long-standing inclusion of polygons, lines, and points.


There are probably some useful applications for this, but I am not really a fan of databases supporting complex data types. It's not because of a commitment to any particular normal form or theoretical construct. It is because on a project that involves multiple developers over a period of time, this sort of special functionality provides "surprises" that are not terribly pleasant.

- Nonstandard SQL is required

- Database specific functions are used

- Later developers can be confused by the use of a non-standard data type

- SQL Commenting does not happen much in practice in my experience

- Array data can be handled using existing SQL constructs, so it is never an absolute necessity

- Other languages are better equipped for handling the data types (in my fuzzy subjective assessment)

My experience is mostly with Oracle, which has been adding various data types for years (XML, Objects, Arrays, etc). I can't think of a specific case where their use proved to be a real specific benefit to a project... though the usual argument is improved performance.


Ah yes, the venerable array type. I remember trying to normalize applications written in VB6 and Access that used arrays with foreign keys instead of many to many relationships. Good times.


While I probably wouldn't store things in an array, it's useful to get data back out in array format sometimes.

    select array_agg(email_address), home_state from users group by home_state
Will give you a list of home states and all the email addresses that belong to that state.


Right. A case I've used them for is a recursive query which returns a set of rows, where each row is some end-point that matches the query criteria, the rows can include arrays representing the list of nodes traversed to reach that end-point, or some notion of the path cost by hop.


Foreign keys as arrays is just wrong.

However, there are many cases where arrays are actually extremely useful, including the ability to be a useful intermediary type for stored procedure interfaces. For example, you can aggregate arrays of rows, and pass those into another stored procedure for processing, or you can use them for pulling info to/from your application. We do this extensively in PostgreSQL in part because DBD::Pg has excellent array support.

These are actually remarkably useful in PostgreSQL. Of course like any advanced feature it can be abused.


The MADlib add-on out of UC Berkeley for Postgres and Greenplum uses arrays for inputs into its algorithms. MADlib.net is worth checking out for some in-database analytic and machine learning goodness. It's not going to replace R or python but it has potential as part of your toolbelt.


Looks like the path to code/data obscurity to me, why not have properly labeled fields for all the elements and save yourself the headache of remembering the fancy trick you implemented years ago.


I used Postgres arrays in a project a few years ago -- The university course catalog was so awful I had to create my own. There was a 7-element array of integers (one for each day of the week) where each element was a bitmap indicating if the class was active in that period.

To the DB driver, it was just a string -- { 0, 0, 0, 0, 0, 0, 0} -- so I split it into an array manually in client code.


Excellent article on Arrays. I wish I knew about these before, there have been so many times I should have used this instead of serializing / deserializing JSON myself via TextFields :(


Postgresql 9.2 supports json, btw!

You can store json, or, even better, you can write sql that returns an arbitrary json structure.

So you can have one sql query return a nested array of hashes of arrays of hashes... handy if you need a retrieve a lot of different data at once that doesn't fit into a neat set of rows.


What we need now is a function that will take an arbitrary JSON element and return a data structure associated with it. This shouldn't be too hard using plv8js.....

This could then be used as an input format for object-relational modelling.


The more you put in the database black box, the more scalability problems you will encounter in the future. It's more economical to scale out than buy bigger databases servers.

Please just don't do it.


Happily, Postgres keeps improving performance with each release. According to benchmarks [1], the upcoming version 9.2 will scale up to 64 cores for read-heavy workloads. Unless your application is going to grow really fast -- keeping in mind that the hardware capabilities will continue to improve each year -- then it's just premature optimization to worry about horizontal scaling. Just upgrade your database server once a year and reap the benefits of Moore's law.

[1] http://rhaas.blogspot.com/2012/04/did-i-say-32-cores-how-abo...


I love your optimism. The real world doesn't work like that. The real world punishes you for every shitty feature you pick and every bad chunk of code.

Our application is 15 years old, and we're on a 64-core machine with 768Gb of RAM and 35Tb of disk.

We're running at 80% capacity.

Where do we go from here? Yes, we rewrite and scale out for the measly cost of £450k. That cost would have been avoided with the appropriate due diligence. That is not a cost anyone wants to swallow.

Then again I don't work disposable CRUD applications...


"The more you put in the database black box, the more scalability problems you will encounter in the future."

I don't see a obvious relationship between using arrays to store product tags and scalability.

What method of solving this problem are you suggesting, and what are the scalability characteristics of that solution?

Databases don't go out of their way to hurt scalability. Databases are shared state, and scaling that is just hard. Unless the problem is easy, in which case it's easy no matter what you use.


Looking at the way arrays are implemented, they can readily turn operations into O(M*N) rather than O(N). Even the PgSQL documentation suggests cases that will hurt you.

I would suggest that the relational model is maintained as there are no cases in which it isn't valid where arrays are and the optimiser is likely to come up with a better solution.

Relational databases (well all databases) have convenient tools which in the short term look good, but in the long term will hurt you.

Careful testing and feature selection is the alternative I am suggesting (from 20 years of experience with RDBMS platforms).


"can readily turn operations"

Which operations, specifically? We have the product tags example to work from, so it would help to be more concrete.

"into O(M*N) rather than O(N)"

I assume here that your target is "scalability to higher numbers of tags for a given product"? If so, please state that explicitly. Again, M and N aren't necessary, because we have an example to work from.

Are the operations still susceptible to worse scalability on this axis if you employ a GIN index?

"PgSQL documentation suggests cases that will hurt you."

Please elaborate.

"I would suggest that the relational model is maintained"

I don't see how your solution is relational and the other one is not. Can you please explain? How is using an array different from using, say, a string, which is essentially an array of characters? How is it different from using a date, the components of which you can access (e.g. grouping by the month)?

"the optimiser is likely to come up with a better solution."

Please elaborate.


Ok here we go:

Table is a list of rows. To access a row is O(N) assuming perfect B-Tree. To access an element in an array is O(M) therefore to access an element of an array in a row is O(N*M) as you have to materialize the entire page that the array is in from disk. That array could be larger than a page which leads to more problems.

GIN is a possibility but the array elements are ordered which is overhead (they are a list, not a relational set) therefore there is a condition there if you want to reference by index which is the primary point of arrays. If not, you should use a set if they aren't ordered which means you should probably use a table and join anyway.

Fundamentally membership is a set relation, not a list relation. The order of the tags is not important.

Pain = PgSQL documentation 8.14.3 which references indices of arrays directly in a where clause.

A string - I wouldn't do string manipulation anyway in a where clause.

A date - that's a number which you can apply relops to easily.

The optimiser will most likely be better at retrieving and caching data based on a join condition versus column contents.


"To access an element in an array"

I guess that's the thing: why would you want to access array elements? What business problem is that solving?

"there is a condition there if you want to reference by index which is the primary point of arrays"

In this example, he's using an array more like a set. I don't think he intends for a tag at index 17 to mean something different than a tag of the same value at index 3.

I agree that postgresql should probably have a good "set" datatype, but at this time it does not and arrays are a reasonable way to implement sets in a lot of cases.

For instance, GIN allows fast searches on queries like "find the arrays that contain element 'X'". That's more like a set operation.

"The optimiser will most likely be better at retrieving and caching data based on a join condition versus column contents."

There are probably some cases where it's better and some where it's worse. Postgres does keep track of stats for array elements in addition to the arrays themselves, so it's not obvious to me that using a separate table would improve matters.


It really depends on the case at hands, sometimes it's more practical to scale up, especially if there is no indefinite grows anticipated.

http://www.codinghorror.com/blog/2009/06/scaling-up-vs-scali...


That's hardly a great reference, especially when he immediately shot himself by not checking vendor limits...

http://www.codinghorror.com/blog/2009/07/oh-you-wanted-aweso...


The general point still stands - scaling up is often a preferred solution and life proves it every day. Not so many companies enjoy facebook style grows, and even aren't viral at all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: