Hacker News new | comments | show | ask | jobs | submit login
Ways to paginate in Postgres (2016) (citusdata.com)
186 points by ligistic 9 days ago | hide | past | web | 46 comments | favorite





The computer programming field is plagued by a tendency that needs a name. What do you call it when the blogosphere pressures you to reject all these time-honored, perfectly good techniques, because of some edge case in a niche field that you will never have to deal with? Like the programmer ensconced in some corporation writing a business app that will only ever be used by 100 people. He rejects using just one server, because that couldn't possibly be enough, you must at least separate your web server from your database. He uses NoSQL instead of SQL because SQL doesn't scale. He uses React and a single-page application, because old-fashioned HTML and jQuery will get way out of hand --- even though he's just printing some tables and maybe some graphs, based on choices made in a form. And so his solution is worse than all of the problems he's trying to avoid combined. And none of them were going to come to pass anyway. Okay, this was an extreme example.

However, I see the same tendency at work in the demonization of poor old LIMIT and OFFSET. Let me talk about OFFSET first. This article talks about its inefficiency. Well, it may be inefficient, but every page I've generated with OFFSET still comes back in a split second. So coding around it seems to me like a premature optimization. Now let's talk about LIMIT. It's limited in reliability, because what if someone inserts while you're paging along? The item at the end of page 3 is now at the top of page 4! More insidiously, what if someone deletes? Then the top result on page 4 becomes the last result on page 3, and you never see it.

That scenario troubles my perfectionistic mindset, but I have to say in all my apps it is not the end of the world. I run into this all the time anyway on other people's websites, major and minor: whether I'm paging through search results on Google and Amazon or the latest posts on some web forum and even Hacker News. It's no big deal. At least, for my purposes it doesn't seem worth doing one of the other methods I read about here.

Now if you are programming a self-driving car or automating the mixture of medicines, maybe be careful where you use LIMIT and OFFSET.


The article isn't "demonizing" LIMIT/OFFSET. It's very clear about the use cases:

> When to Use: Limit-offset

> Applications with restricted pagination depth and tolerant of result inconsistencies.

I don't agree with your conclusion that result inconsistencies are "not a big deal" for most cases. It's a deliberate trade off. For example, I think it is annoying on Hacker News, but I understand why they chose to show inconsistent results (If I recall correctly, HN used to have consistent pagination using something similar to cursors, but they threw it out because it caused too much server load and it was annoying when sessions expired)

On the other hand, when my accountant goes through my expenses one by one to check if they have been booked correctly, I don't want him to miss lines due to inconsistent pagination.

Just because a technique is "time-honored", it doesn't mean it's "perfectly good" in every situation, or even in most situations. You always need to evaluate your techniques, no matter how common they are, to see if they work for your particular use case.


The article says it is "most perilous." Another comment mentioned another blog, http://use-the-index-luke.com/no-offset, says to never use it.

This is the syndrome I was getting at. An article attacks a shortcoming of an established way of doing things and then glosses over the deficiencies of its own alternative. The deficiencies of the original solution aren't a problem for most people, but the deficiencies of the new one are.


This post doesn't reject limit, offset -- it's just that you shouldn't use it if your data is very deep or if you need hard consistency. It's nice that you live in a world where this stuff doesn't matter though. Congrats on not having these requirements.

> That scenario troubles my perfectionistic mindset, but I have to say in all my apps it is not the end of the world. I run into this all the time anyway on other people's websites, major and minor: whether I'm paging through search results on Google and Amazon or the latest posts on some web forum and even Hacker News. It's no big deal. At least, for my purposes it doesn't seem worth doing one of the other methods I read about here.

While I agree with nearly 100% of what you're saying, I have run into situations where an application requires an infinite-scroll style of pagination, where the typical LIMIT/OFFSET approach can be problematic: users end up seeing duplicates while scrolling due to the ever-shifting boundaries, which seems to bother them more than if it was a completely separate page that they reloaded and saw the same duplicates.

As such, in any situation where we encounter an infinite-scroll pagination setup (which is quite common on mobile applications) we've implemented keyset-based pagination. It requires a modicum of additional thought to ensure it is correct, especially if you have very… interesting sort conditions, but ends up being quite bulletproof once implemented.


I think the term you're looking for is cargo cult programming.

Sure, applying keyset pagination prematurely can be a form of cargo cult programming, but OTOH I'd rather work with somebody who is aware of the different pagination options available. Hopefully that same person can also choses the right implementation for the right problem, but that's sometimes easier said than done.


> Now if you are programming a self-driving car or automating the mixture of medicines, maybe be careful where you use LIMIT and OFFSET.

There are plenty of cases where the LIMIT/OFFSET behavior is not great. More boring example: you’re displaying a paginated list of financial transactions and you don’t want duplicates to appear if the list gets updated. Or any kind of audit log. Or an event feed, or infinite scroll.


> What do you call it when the blogosphere pressures you to reject all these time-honored, perfectly good techniques, because of some edge case in a niche field that you will never have to deal with?

At first the term "over-engineering" came to my mind, but that term doesn't quite catch it.

Possibly related discussion (about applying blockchains where it makes no sense): https://news.ycombinator.com/item?id=15401447


Yes, overengineering is close.

The tendency I was thinking of isn't chiefly about chasing the new shiny thing (Magpie programming) or mindlessly including needless libraries (Cargo Cult programming).

The thing I'm thinking of is when someone points out a shortcoming of a particular way of doing things. Like, "If you use a relational database, it might go down if you get 50 million writes at once."

All techniques have trade-offs. So it's no surprise that some established way of doing things has one. The criticism is valid. The thing will fail in that situation. The problem is that the alternative presented by the blogger has more problems than the first. It would be useful in a rare kind of job. But that's not made clear, or the reader can't see past the stain that was shown on the old way of doing things, or the reader can't get past how cool it is that there's a database that can handle 50 million simultaneous writes, or the mere novelty is intoxicating (so there is some overlap with Magpie programming).

This new way doesn't have the first one's shortcoming, but it is literally 10 or 100 times as much work to set up, is missing certain important features that the original solution had, and solves a problem that the reader will never have.


I think this might be a communication problem, where more mature technology needs to be framed differently in order to reach a larger audience with a different background. Maybe this is a direct result of the technical complexities involved, so each engineering audience has a specific culture/lingo that needs to be taken into account when "selling" them a piece of technology or a certain approach.

As other points out, the article doesn't demonize LIMIT/OFFSET at all. In fact, it's offered as one way to paginate. That said, there's a good reason why it's worth demonizing.

First, LIMIT/OFFSET will not give you a transactionally consistent view of the data. If anyone inserts or deletes rows while you're paginating, you'll get dupes or holes. So already you're on thin ice. Good, as the article points out, for stateless pagination in Web 2.0-style web views, but problematic for any application that needs a consistent view (e.g. infinite scroll, or if you're, say, indexing everything into a search engine). Developers might easily miss this.

More importantly, LIMIT/OFFSET doesn't scale to particularly large datasets, and it's one of those things that will bite you at the worst possible time — i.e. when the size of your application plummets over a certain performance threshold that suddenly causes lots of queries to pile up because they're all at OFFSET 10000000. (Watch out for Googlebot paginating everything to infinity!) Since LIMIT/OFFSET requires the result set to be sorted on every query, this paves the way for some truly terrible query plans. If you're lucky, you'll get a fairly efficient index scan, but if you have joins and subqueries, things can get impossibly slow.

I'm currently working on an application where even the "SELECT ... WHERE id > :last_id ORDER BY id LIMIT :page_size" trick is failing me, with queries taking 30-60 seconds because there are many going on concurrently. The entire table is maybe 10 million records, and the WHERE is very selective (i.e. it's looking at a fairly small portion of the full table), but it's still a problem. LIMIT/OFFSET worked back when we had just a few hundred thousand rows in that table.


Whether you're talking about LIMIT/OFFSET or ORDER BY it's still an important thing to be aware of. The crux of the matter is being aware of how much data you're forcing the database to sort to get your end result.

If you're expecting 100 results back from a few million records, just being aware that if you don't trim the results in your WHERE clause it will force processing on that ORDER on a lot of records you're never going to see is a big step.

Totally agree in regards to premature optimization that LIMIT/OFFSET is tremendously more convenient. When it becomes a problem, it's good to know where to look.


If someone else wants to use React then so what ?

Skill set in the team, supportability, experience etc are all just as important as which technology to choose.

And so if someone knows React better and are able to deliver business value quicker then how is that not a good thing. These days it's far easier to find people with React experience than JQuery experience. Likewise for NoSQL or whatever other technology.

Pretty insulting to assume that everyone who uses React, Redis, Cassandra etc are all only driven by blogs and not by any rationale thought.


"premature optimization"

In the article's defense, pretty clear and reasonable about when to use what.


Pasting from http://use-the-index-luke.com/no-offset (2014); any updates on this list since then?

--

The hall of fame of frameworks that do support keyset pagination is rather short:

jOOQ — Java Object Oriented Querying. Docs.

Ruby order_query

Django (Python) chunkator

Django Infinite Scroll Pagination.

SQL Alchemy sqlakeyset.

blaze-persistence — a rich Criteria API for JPA providers

Perl DBIx::Class::Wrapper


Doesn't that strategy only work when you're paginating one page at a time, like Reddit?

How would you implement a deep pagination like "go to page 107" unless you can derive page 107 from your order criteria?


We have this on one of our web apps, and it is a problem. To me the problem is why do we allow jumping to page 107 of 736 in the first place? How does this help someone? The two use cases are 1) user is manually performing a binary search on the results, 2) a bot needs to index the site. Both of these use cases are better solved in other ways, with either search forms that do not suck or index pages generated specifically for crawlers. So the answer to 'how would you implement a deep pagination like "go to page 107"' is to not implement that, and instead improve your design.

>user is manually performing a binary search on the results

Imagine a large forum topic. The user is just seeking through time. A deep pagination here is when the user knows what they're searching for is more than 1 page away.

The user would have to know a snippet of text or an explicit date range for your search box to help, so what about all the times they don't or if their filters are insufficient? This is why forums have both deep pagination and a search box.

It would be cruel to make the user paginate one page at a time just like it'd be unbearable if Youtube didn't let you jump 10 minutes at a time.

Obviously not every pagination needs deep pagination, but that's between you and your users, and it's what came to mind when I read that article.


And that’s why Google’s pagination on search results is broken, and why so much usability was lost. Sometimes I want the lowest ranked results for a query, or the middle-of-the-field results. Let me have them.

>Sometimes I want the lowest ranked results for a query, or the middle-of-the-field results.

What's the use case for that?


Google results are gamed so much, that often personal blogs that don’t do SEO, but offer interesting content, appear beyond page 5 of Google search results, but are still relevant. Similar issues appear in many situations where search is used on third party results, but even on first-party results the first page is often gamed.

I’ve been thinking about it, and ideally one would make a search engine that only indexes pages that have no analytics, tracking, ads, paywalls, etc. Then I’d find those same pages I’m searching for.

To give an example: I found http://blog.deconinck.info/post/2016/12/19/A-Dirt-Cheap-F-Aw... on page 3 of Google for "raspberry pi led", just below https://tech.scargill.net/home-control-2016/ (both of which are interesting IMO)


It's a first result if you search for "raspberry pi led table"...

Yes, but I wasn't looking about tables. Just general things to do with led strips and raspberry pis.

Google is great for finding things you already know, but I'm using it for discovering things I didn't even know I was interested in.

I can't exactly just put the entire dictionary into google to try and find what I might like. I enter a category I might like, and look through the results matching that category.


> How would you implement a deep pagination like "go to page 107"

You wouldn't. Keyset pagination forces the user to do the exact same thing people complain about you doing to the database. Want page 10 of results? Too bad, enjoy running through pages 1-9 and throwing them away. Except now everything is even slower because it's users talking to the service, not the service talking to the db.

There are plenty of use cases where that's a perfectly acceptable trade-off, or an annoying-but-necessary trade-off, but the people all up in a froth over the inefficiency of limit/offset seem to regularly fail to realize that they're advocating for a database performance solution with implications that reach all the way out to the UI/UX level.

At a minimum, I see very little in the way of acknowledgement of this factor.


When you are jumping large numbers of pages, you usually don't care about exact page boundaries.

For example, you could have a background process that slowly paginates through the table and saves off the page boundaries, which you can then use for jumping into. They will always be a bit off especially near the middle of the table, but that's probably okay.

There's lots of ways to do this fuzzy jumping that don't involve scanning the full table synchronously.


First, you should consider whether you actually need that. Also note that it's not just reddit, HN uses the same system.

Second, you can just perform an indirection through offset (+ limit = 1).


This is true. It depends on application but I wonder how often users really need to go to Page 107 of something. Seems like maybe this use case should be solved with some kind of search instead. I have implemented offset pagination myself to several apps where being able to paginate by jumping to the page with records starting with "R" would make more sense.

In most cases this is probably better, but sometimes there is no logical filter that works on you hvae to rely on sort order. One example might be a gallery, I might decide to look though pages 1-10 tonight, 20-30 tomorrow, etc. An arbitrary primary key is the only discriminator that makes sense here. Another example would be classical forum posts, they are sorted by last active but sometimes I'll want to jump to page 134 because that's where I got up to last time and bookmarked my place.

That said, if users were doing this in my software I'd definitely be looking at how I can improve their workflow.


I wrote nexter (https://github.com/charly/nexter) a few years ago without knowing I was applying the keyset principle (it's not pagination though but a next page library). It works well on a consistent data set but often falls short with Nulls, too many associations and whenever I aggregate stuff. Also it generates so many ANDs & ORs you do feel safer using Offset & Limit. I find it frustrating there's no easy solution provided by DBs for such a universal and simple need...

Thanks. I've added it to the Hall of Fame: http://use-the-index-luke.com/no-offset#frameworks

Also Ruby: sequel-seek-pagination

https://github.com/chanks/sequel-seek-pagination

I used it just this week to write a Relay Connection paginator for graphql-ruby and Sequel/Postgres. It uses keyset pagination to reliably avoid skipping items and maintain consistent query times.

Backwards pagination was a bit tricky since that’s not natively supported by sequel-seek-pagination. Basically I had to reverse the list order and then filter with the cursor values to get the last N items in a subquery, then reverse that list again to get back the desired sort order.

https://github.com/rmosolgo/graphql-ruby/pull/1014


Thanks. I've added it to the Hall of Fame: http://use-the-index-luke.com/no-offset#frameworks

I recently implemented one for Node.js (as a Bookshelf.js plugin): https://github.com/binded/bookshelf-cursor-pagination. It supports multiple ordering as well.

Thanks. I've added it to the Hall of Fame: http://use-the-index-luke.com/no-offset#frameworks

> any updates on this list since then?

When I notice, I add/update this list.

I'll check those mentioned in the reply's and add them if the seem to be right.


Nice.

I would recommend adding some sort of a 'last updated' type thing at the bottom of the article, along with a note that you update near that list.


I've added the other frameworks.

As for the update marker, I've just added an "updated" date to next the the publishing date in the breadcrumb.


I was unaware that this is called "keyset pagination". The name doesn't make much sense to me, could someone explain what it is supposed to mean and why it describes this pattern? I called this offset or key-offset pagination; it works well in any database with ordered indices (like B-trees), including ZODB.

A key option is missing from this list: returning everything (with a sanity LIMIT far beyond the usual result count) and just doing the pagination in JavaScript.

I'm sure this will get me all kinds of backlash from noscript purists and optimization fanatics, but it works great, is easy to implement, there's millions of JS table/list libraries that give you the pagination for free (plus free sorting and filtering).

It is by far the most productive option for the developer, if the data size allows (I'd wager it often does), and it does not even have the problems this article describes about OFFSET.


From paragraph 2 of TFA:

> Before continuing it makes sense to mention client-side pagination. Some applications transfer all (or a large part) of the server information to the client and paginate there. For small amounts of data client-side pagination can be a better choice, reducing HTTP calls. It gets impractical when records begin numbering in the thousands.


Ah damn, missed that. Thanks.

I once worked with an insurance company that promised to provide a paginated search for claims. The end result (1) returned 10 at a time for each call (2) with no sort order (3) and no search or filter, just everything. The list averaged 1000 items normally. They also declined to fix this.

I work for big banking company. Here most of the search APIs return 1000 results, no pagination no sorting.


What about using "Row_Number() Over()"? I'd assume it would have similar performance to offset.

One of the most useful resource saving articles I've read in years.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: