

Tagging in Rails 4 using PostgreSQL arrays - ricny046
http://rny.io/rails/postgresql/2013/07/28/tagging-in-rails-4-using-postgresql-arrays.html

======
deno
Using arrays for relations, like tags, is a bad idea. If you try to do
anything more complex (joins) later with them you’ll find yourself creating
temporary tables with UNNEST and your performance will plummet. This is
especially true since materialized views are still WIP.

Arrays are best used for logical units of data that you’d otherwise serialize
or put in several columns. They are for avoiding having to create additional
tables for non‐relational data not the other way around.

Think of them as complementary DIY solution to PostgreSQL’s geometric types,
like Point or Box[1].

From PostgreSQL manual[2]:

> Tip: Arrays are not sets; searching for specific array elements can be a
> sign of database misdesign. Consider using a separate table with a row for
> each item that would be an array element. This will be easier to search, and
> is likely to scale better for a large number of elements.

[1] [http://www.postgresql.org/docs/9.2/static/datatype-
geometric...](http://www.postgresql.org/docs/9.2/static/datatype-
geometric.html)

[2]
[http://www.postgresql.org/docs/9.2/static/arrays.html](http://www.postgresql.org/docs/9.2/static/arrays.html)

~~~
saurik
And yet, PostgreSQL ships with two index solutions (the ones mentioned in this
article) that allows you to index them as if they were sets, supporting highly
efficient queries of the form "give me all entries that have the four tags on
them". This really depends on your data model and what kinds of queries you
want to perform. Even that tip talks about large numbers of elements: you
don't usually have large numbers of tags.

~~~
deno
Sure you might get away with it, that’s not the point. The point is tags are a
type of relational data and there are already better (and proper) ways to
manage relational data in a relational database.

~~~
saurik
PostgreSQL is an object/relational database, and it is not clear that pulling
this out to a relation is "better". This isn't a case of denormalization, as
the data isn't being duplicated, and the database supports indexing the
content correctly, even having appropriate primitives for manipulating it.

Let's put it this way: the set of tags an object has on it represent its
location in an N-dimensional (where N is the number of possible tags)
coordinate system, much like a longitude/latitude pair represent a location in
a 2-dimensional coordinate system. Tags are just more discretized.

If you are thinking about the search space of tags like this, wanting to ask
questions like "find me documents close to this one in tag-space", then it is
not just less efficient but _semantically incorrect_ to pull this information
out into a separate table: this isn't some NoSQL BS, this is math.

To put this in perspective with the longitude/latitude scenario, this would be
like storing a separate table that had "post, axis, value", with post as a
foreign key to the post being located, axis being either "longitude" or
"latitude", and value being the floating point coordinate.

Many, including myself, would say separating longitude and latitude like this
is a misunderstanding of the relational data model; its the kind of thing an
ORM would do to our data, and ORMs are not known to think much about semantics
in the database: they are just mapping concepts blindly.

Pulling tags out into a separate table is doing the same thing, conceptually,
and causes the same issues asking questions like "similar document" as
separating longitude and latitude causes "similar location". You can't index
that location as a single "thing" anymore, as it isn't one.

However, if the questions you are asking are more "give me the list of posts
tagged with this specific tag", and the tagging event is a "thing" that maybe
even has who performed the tagging, when the tagging happened, then you'd want
a tagging relation. It's all about semantics and use cases.

~~~
deno
In this example longitude/latitude tuple is a single logical unit, and a
perfect example for using arrays, though of course PostgreSQL already has this
covered. However this isn’t comparable; it’s a single location, single column,
not a list of locations. A list of locations gives you choice between using
multi‐dimensional array ([[x,y], [x1, y1]…]) or a relational design, and we’re
back at square one.

I think the crucial element here is whether the set is bounded. A set of
coordinates is bounded, a set of tags usually is not.

I know I made this point about semantics in my last post, but I’m actually
more concerned about the practical aspects of working with arrays used in
this, in my opinion, relational (unbounded) case. It may be so that arrays in
PostgreSQL will become so powerful as to provide complete alternative for
forming relational models… however that wasn’t my experience so far. More
complex joins require you to go through UNNEST() and essentially normalize the
arrays into temporary tables, which makes for terrible performance.

The relational design has also other advantages, a composite primary key has
unique guarantees, something that’s not yet possible for arrays, and separate
table always gives you the flexibility of including additional metadata as
your needs evolve.

For me the question really is ‘what are the practical advantages for using
arrays over a relational design,’ and not the other way around. After all,
PostgreSQL is still mostly a relational database. And of course, as others
have pointed out, there are some advantages, but most of them would be
considered (micro)optimizations, and I’d rather leave those to database
developers.

~~~
saurik
Operations you tend to perform on tags: "find me posts that have all of these
tags", "find me posts that have similar tags to this one", "order the results
of any of these queries by properties of the post". Something you usually do
not do with tags: "give me a list of every tag that the system has ever seen
used on a post". Tags normally don't themselves have properties. The event of
a tagging usually does not (but sometimes does) have properties either. I am
not at all certain why you would find yourself needing to UNNEST: there's
nothing you'd do with a tag where that would be interesting or valuable.

In my previous comments, I offered multiple "practical advantages" of using
arrays for this case over relations; you are choosing to view them as
"microoptimizations", but I'd say the exact opposite: every operation one
normally does with tags, when used with arrays, is a single operator in a
where clause. Even the simplest case, "show me posts with these tags" is at
least slightly harder with the relation. In general, the clearest code with
the right semantics tends to also be the one that gives the database the most
room to optimize: its the same here; you are just muddling up the
N-dimensionality of tags by moving them into a relation.

Your attempt to compare a list of locations doesn't make much sense to me, as
a list of locations doesn't seem to fit well with the question "give me a list
of posts that have similar sets of locations to this other sets of locations";
if you can explain the semantics of that use case more, I might be able to
appreciate it. However, right now, I think you are still just caught up
looking at tags at the wrong scale: a set of locations is more similar to a
set of tag clouds, not a single tag cloud; a tag cloud _is_ a single location
in an N-dimensional space, and all of the operations developers tend to
perform on that data type are easier to think about, easier to optimize, and
even easier to type the SQL for with PostgreSQL, than thinking about them as
discrete data items.

What separates PostgreSQL from other database systems in this regard is that
they actually have index implementations that can make this obvious
translation of the semantics of tags fast: normally, you are forced to do the
relation, and you end up in the hell of working with relations for things
where they make no sense at all. As an example: "give me all the posts from
last friday, and all of the tags on each"; this is the exact opposite of your
UNNEST scenario: you end up having to rebuild the records of the posts here by
doing a join and an array_agg. It would be ok if you bought something, but you
didn't, and the only reason it even seems to be "more relational" is due to a
backwards understanding of what a tag cloud is (like, to be clear: I really
don't like the denormalization-is-good crowd, and I think NoSQL is generally
the wrong solution to any problem, but I maintain that splitting a tag cloud
into a relation is like splitting a basic data type like a long/lat, or
attempting to do an ORM columns->rows mapping, both of which misapply
relations; its like you are reading what I said as if I'm talking about
long/lat being changed into an array, and totally ignoring the semantics of a
long/lat as a location in a two-dimensional space... the fact that "PostgreSQL
already has yo covered" for something like long/lat precisely demonstrates
that people wouldn't even consider breaking those apart).

Can you provide me an example of your problem using arrays here? You seem to
just be asserting the same point that they are somehow harder over and over
again. The only such example I'm seeing in this post is "uniqueness
guarantee", which is a property of a single array and thereby trivially
described as a check constraint on the column. This kind of set-oriented use
case, FWIW, is actually precisely why GiST and GIN were created, and why they
support the intarray contrib module; and this kind of facet-oriented search is
one of the reasons why PostgreSQL got so many great contributors via PostGIS.

~~~
gbog
I think you pointed out the main issue with arrays. If I follow your reasoning
and suppose we will not want to know who is tagging what and when because we
believe we just want to know the bare list of tags attached to an item.

But then the app complexifies, we need more data, and we would like to have
ansers to questions such as "is tagging different in the morning or in the
evening?", "are there new trends in tags or new tags since last year?", or we
even simply want to remove the tags created by some users because we know they
try to cheat the system or spam it.

Then with your "simple" solution I think we are stuck and will need to
migrate, which is a pain. Moreover, all past data is unsuable.

If, on the opposite, if we just do a "who tags what when" table, we are on
much safer ground. We get a scalable event table that can be indexed (and
sharded) by timestamp. We can undo any operation (which should be required for
any action but email sending). The only drawbacks are 1) we need to aggregate
and cache this data to have a list of tags for each item, and 2) we might have
stored some data we might not need in the future.

I always feel the saved data granularity should be one level finer than the
data granularity used in the business model.

~~~
saurik
> we even simply want to remove the tags created by some users because we know
> they try to cheat the system or spam it

I'm not certain we are even talking about the same use case for "tags" here.
If different users can tag the same post you start having to ask other
interesting questions regarding permissions and ownership. The model I'm
looking at is more blogger or delicious: the posts by users are distinct, and
are owned on data records that are owned by those users. In the delicious
case, we are all tagging the same URL eventually, but each user is really
submitting a post for a URL with its own tag cloud. You cannot add tags to my
post: you have to make your own post that references the same URL and put tags
on that. The semantics of groups of people tagging things sounds awkward for
arrays (and the one system I've built that was similar to that, where tags
actually have ownership and permissions on their use even, did not go down
this route).

=====

(I wrote the following _before_ noticing the incompatibility of your "a user
somehow tagged something" case with my mental model of what we meant by tags;
I still think it might be valuable, though.) I think that you are focussing on
a narrow use of tags where all they can do is "list entries tagged by"; if
that's all you are doing, then go ahead and use the repation.

However, please understand that the app didn't "complexify" as you added those
data items, and the data model didn't somehow improve to handle them: you
_lost_ the ability to do queries of the form "find documents similar to this
one" without doing heroics. This is a trivial single select order by operator
query with a cloud, and this is nearly impossible to represent using a
separate relation.

Now, think about the new queries you just wanted to perform: they were on
properties not of present-tense taggings, but of past-tense ones. If someone
adds a tag in the morning, and later deletes it, your business intelligence
query still cares that they added that tag in the morning. This means you
can't ever delete records, and end up adding "deleted" keys. This, in turn,
makes your unique indexes require predicates, as all of your queries. It also
means your "hot" user data includes "dead" information it has to sift through.

Again, you're going to try to tell me this is a "micro optimization", but I'm
going to say the exact opposite: for one ninche use case that end users don't
even interact with, you made your app harder to implement (as the relation
ways of accessing the tags is more awkward to code), less functional and
slower (as now some queries are impossible at worst, often now unindexable
when possible, and even at the best being bogged down by skipping data that
serves a different purpose). This is the same argument against micro-
optimization.

Instead, I urge you to consider business intelligence issues as fundamentally
different than your functionality issues. The way to solve this problem is to
store events I how tags are use in a separate database, possibly a totally
different kind of database, that is setup to store historical information of
every tagging event that ever occurred (or, even, starts compressing this
information into time windows). There is a reason why BI-specific database
solutions exist, or why Amazon offers both red shift and DynamoDB: they are
just different things.

------
calebio
I think one problem with this is that it doesn't behave like a set which I
_believe_ acts_as_taggable_on on does. It doesn't allow a user to tag an
object with the same tag more than once.

It seems that the Postgres array method would (without other backend code to
prevent it) immediately allow a user to tag with the same tag over and over.

So for users who are more comfortable using an out of the box solution (even
if they are using Postgres) they may have better luck with
acts_as_taggable_on.

Either way, this is pretty cool!

------
film42
I just approached a problem with tagging about a week ago and wanted to use
postgres arrays, but the problem was, I felt like adding proprietary
components to ActiveRecord killed the flexibility of my app. Meaning, I'm 99%
database agnostic and the 1% keeps me from migrating to some xyz cloud sql
solution if I needed to scale up quickly and not get hosed by pricing schemes
of certain vendors.

Does anyone else have similar thoughts?

~~~
fdr
I don't think worrying about database portability often makes much sense for
projects with beyond the simplest access patterns and projects that one does
not ship to a client, where one anticipates needing to support many possible
databases that are supported on-site. Oftentimes, using features supported
only by one or a few database management implementations can make things
substantially easier.

Ports are often difficult and risky without substantial continuous integration
all along because of other quirks in the implementation regardless, both of
any ORM layer and of the DBMS.

One advantage of arrays can be performance: not having to locate several rows
in another relation can be a time-saver; in this case it can act like a nicer
denormalization. But, as array elements are not quite first class in SQL (as
the manual notes) it doesn't work quite as smoothly if one plans to enrich
their tags with a lot of meta-information or other relationships.

In any case, I have used this strategy on an analytic-oriented system and it
worked great: the queries were simple to read, performed well, and the
representation of the row was convenient to use in client programs. I've also
used hstore to convenient effect on more OLTP-oriented systems although it too
is a Postgres-only (for now) data type extension.

------
slaxman
I am a rails developer and recently shifted to using postgres. it's stuff like
this that I love about the rails postgres combination. It makes writing code
so easy. agreed that it may do 80% of functionality of current gems but with a
little tweak it can achieve the same while keeping your code minimal and
readable.

another example is the hstore feature in postgres. it makes storage of json
type data in a rdbms a breeze.

------
lbarrow
What's the advantage of doing this instead of just making tags a separate
table with a many-to-many relationship with documents?

~~~
plq
Many-to-many relationships come with their own overhead -- you use additional
resources for foreign keys, indexes, etc.

If your data is seldom updated (or not that big), not used for database
operations, and having it all in memory is fine, it can be better to use
native arrays.

~~~
deno
It’s hard to tell how big of an advantage this is, if any, without some actual
data.

------
instakill
If anyone has problems with updated data in the arrays not being persisting, I
actually wrote a post on that today
[http://paweljaniak.co.za/2013/07/28/rails-4-and-postgres-
arr...](http://paweljaniak.co.za/2013/07/28/rails-4-and-postgres-arrays/)

