Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Using PostgreSQL Arrays The Right Way (heapanalytics.com)
80 points by drob on Feb 25, 2014 | hide | past | favorite | 38 comments


While this is really cool, can anyone tell me if it has become any easier to write stored procedures/functions?

Last time I tried writing stored procedures I went into a deep depression from the lack of a proper IDE/editor/debugger/tester/anything.

Is there any process out there besides time consuming trial and error? Or does everyone who writes these knows them by heart and I should just stick to programming languages?


Has anyone tried out python in postgres? (http://www.postgresql.org/docs/9.0/static/plpython.html). In the past, I've tried to avoid stored procs for the reasons mentioned above, but have not yet had the chance to try out python and would be interested in hearing about people's experiences with it.


Pretty good, you can even access local python libraries for direct use in your db functions. Only negative is that you have to re-compile your functions to update the library references.


I have not spend any time with it but https://github.com/petere/plpydbapi looks promising to me.


I have, it's great to be able to write DB functions in Python. Only downside is that Amazon RDS does not currently support this extension!


I've been writing PostgreSQL stored procedures a.k.a functions for almost 10 years. They're great, and have only gotten better. I regularly switch between SQL, plpgsql, and python depending on what's best for the job. Also, plpgsql allows for some really nice functionality with relative ease.

Honestly I don't understand how anyone can do more than a basic query in the command line for PostgreSQL. I've been using EMS's PostgreSQL Manager (www.sqlmanager.net) for years. Has auto-complete, even from local aliases. Named, numbered, parameters. Provides base templates for your BEGIN and END, and lots of other goodies. I could work without it, but it would be much more time consuming. I do however, practice most things in the SQL Editor first. Though it does have good built in analysis, sometimes it's nice to just try different things before committing them to a function.

Edit: Honestly, I wish they had better screenshots, but this should give you an idea:

Functions: http://www.sqlmanager.net/en/products/postgresql/manager/scr...

Debugging: http://www.sqlmanager.net/en/products/postgresql/manager/scr...

Triggers: http://www.sqlmanager.net/products/postgresql/manager/screen...

Query Builder: (used this much more when I was younger) http://www.sqlmanager.net/en/products/postgresql/manager/scr...

Full list: (though don't know how current) http://www.sqlmanager.net/en/products/postgresql/manager/scr...


psql also has auto-complete. honestly i really like psql. have a look at its manpage for how much you can customize it to your needs!


IMO psql is nice for basic things, but when you're writing hundreds of functions it's nice to be able to navigate and manage/edit things a bit more quickly.


when you are writing hundreds of functions, do you do that live? i use text files and version control. psql is for tinkering around with the db when more disciplined approaches seem like overkill.


Just this week I had my own go at plpgsql again a few years after my last attempts. I have come away with the same conclusions I did last time "oh wow, this is really cool and fast and great; oh bummer, this is taking way too many of my own cycles to get something working".

The pg docs unfortunately weren't much beyond selecting some data. Trying to write any logic is something that left me struggling (in my case, I needed a hash table that could store a key value pair of numbers).

From my own experience a few years ago, however, go look at the stored procedures in PostGIS. There are many, many of them methodically written and you could probably learn most of what there is to know from detailed inspection.


Augh, I needed a hashmap in plpgsql last week! It's the worst. Best I could come up with was an awful hack in which I left stuff in a table with columns (key, val, invocation_code), where the third is a UUID specific to the invocation of the function. (Disgusting! I'm not proud of this.)

I'd pay (one upvote) for a blog post with a better way to do this. If one doesn't exist, this might call for a postgres extension.


The extension you want is hstore. It allows you to create variables and columns of type hstore which is a (String -> String) hashmap. If you actually needed something more like (String -> Integer) or (String -> Decimal), then you'll just need to cast any values on their way out of the hstore.

http://www.postgresql.org/docs/9.1/static/hstore.html


This was actually the last thing I was looking at before throwing in the towel and moving back to the app server in the interest of getting things done. I'll have to give it another go some time, thanks!


I guess we could do it with hstore. I wonder what sort of performance I'll get on it for a hashmap workload. Hmm...


Compared with adding multiple rows to a table to simulate one local variable? I expect readability and performance will both be greatly improved, but... it's hard to offer a good recommendation without knowing what sort of calculation you're trying to perform.


I've found performance pretty decent when using hstore with indexes. You should read about what's coming in hstore 2, also.


Do you have a good description of hstore 2? A google search found a few references that there is an hstore 2, and that it might be related to PG 9.4, and that it might allow the values to be non-string, and it might allow nesting... nothing very clear.


I've not gotten a chance to play around with it yet, but there's plv8x (https://github.com/clkao/plv8x), for managing plv8 functions.


There are plenty of tools (including editor support, debuggers, testing frameworks, call tree tracers) around stored procedures but I cannot vouch for their quality since I do not use them myself. I mostly do my coding by heart and my testing in the applications I write.


we heavily rely on postgres as well, in the past we have tried to do the entire backend on postgres as we believe it could be the best solution, but we have had a terrifically hard time of maintaining stored procs, etc. We still use them, but a lot less than we probably would and wound up building an ORM and moving most logic to PHP.

I would be really interested in how you maintain your database code if you are up for another write up or have some time to chat.


For now, the following works for us:

- For plpgsql functions that are required by app queries, we just roll them out when we write them / when they change (via ansible).

- For plpgsql functions used in jobs, the job just reloads the relevant plpgsql functions on the relevant DBs before they start doing anything. (It's a little wasteful, but not in a way that matters for now.)

- The UDFs we've written in C don't change too often, but we deploy them manually when they do.

What are some of the headaches you've had in managing stored procs? How often is your app code changing / requiring new ones?


One headache I had in the past was with a highly-available app with stored procs. It was rare to do parallel app server restarts, they were instead done in serial. (This did make certain migrations a pain.) Outdated app servers would continue calling pgpgsql functions in the outdated way until they received new code.

We solved it with a small wrapper around the calls to plpgsql functions — we set up a build process to generate unique names that were called by version. During deploy, we'd have two or more versions of the 'same' function running in parallel. The last deploy step dropped the now-outdated functions.

It worked relatively well.


Have you thought about using schemas and the search path for this? That is what Zalando does to deploy stored procedures without risking any downtime.


If I had to do that again, I would. It's been a while, but as I recall from that project the database adapter had shoddy support for search paths.


This is not really an answer to your question, but it sounds like you have PHP developers trying to become part time database developers. I don't mean this negatively, just that server-side developers are different from database devs which are different from UI/front-end devs.

Once you get beyond normal queries and enter the realm of real database development you likely want someone who is a genuine database developer.

I'm worked with several database developers plus my wife is a database developer. They are a different breed, and they really develop applications in the data layer.

I can say from experience that once you add a database developer you'll never look at the data layer the same. Often, there is a substantial amount of work that can be performed better by the database (whether it's Oracle, SQL Server, Postgres or whatever) since they have 30+ years of tuning for specific data operations.


no I don't have PHP or database developers, I have hackers and they are trained to be great at database development and PHP development. I don't mean this negatively but you shouldn't hire people who can't figure out a new technology.


If hackers 'trained to be skilled at database development' and good at figuring out new things couldn't make Postgres work for you, why do you believe it "could be the best solution"?


I have seen some companies using schemas and the search path to deploy stored procedures. In every deploy a new schema is created for the deployed version and all stored procedures are added to it from the repository. The new version of the application will use the new schema while the old version still uses the old schema. When the old schema is no longer needed it can be dropped with all procedures.

I have seen other ways to deploy stored procedures but these often require being very careful about function API:s and delta scripts.


Why use an array instead of a separate table? Are you trying to avoid a JOIN?


Yes. Our schema is fully denormalized, which is particularly important for performance on funnels. (See heapanalytics.com/features/funnels)

In particular, to compute where a user drops off in a funnel, I need to scan one array from left to right, and I don't need to do any joins. This shards very well, since all of a user's data lives on one shard, and most of the queries are aggregations, which are simple to reassemble from subqueries.


So why are you using an RDBMS?


> So why are you using an RDBMS?

Because even a shitty RDMBS is more robust, more secure and faster than any NoSQL wankery for this kind of use case.


How many times do we see the opposite question getting asked?

Postgres is robust, feature-filled, and scalable. You dont have to want all three to extract value from it.


Postgresql has gradually and intelligently been incorporating NoSql features e.g: hstore, hstore2, json, jsonb, etc. There are powerful features like indexing, etc.

There are even cases where performance has been found to be better than e.g: mongo: source: http://obartunov.livejournal.com/175235.html

There is a lot of momentum behind the improvements/additions. We will be seeing much more NoSql in 9.4, 9.5, and beyond.


> Our schema is fully denormalized, which is particularly important for performance

And yet appending a single event (you dedupe for each event, right?) takes half a second. That's an eternity!


Query performance is critical. Insert perf is a comparatively minor concern.

Even so, there are a few factors to consider:

- Half second dedupes are for users with 100k+ events, which is <<1% of them.

- We batch events for ~5s before adding them to the cluster, so we aren't deduping for every event -- only once per ~5s of events per user.

If this becomes an issue, we can remove the deduping from normal operation and only call it when we're backfilling / updating events. Even so, we still need this function to exist, and the 100x performance improvement is very helpful.


Does anyone know offhand whether fixed length values, e.g., bigint, have this problem?


No, this isn't an issue for fixed length types, although you're still better off using unnest instead of explicitly indexing into them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: