
Ask HN: How developer friendly is APIs with cursor based pagination? - QueensGambit
I see companies like Shopify deprecate their existing pagination in favor of cursor based pagination. As a developer, I always hated cursor based pagination because it restricts me to serially move from one page to another. Is this developer friendly? Why do companies migrate to cursor based pagination?
======
kamikaz1k
My company is actually going through the transition to cursor/keyset
pagination. As a SaaS company starts having to support larger customers, they
will likely have to go through this exercise. The reason is because most
pagination is usually implemented with LIMIT/OFFSET in a SQL query. Using
OFFSET becomes very expensive for pagination queries because the query has
gotta do a full scan till it reaches the desired offset.

Imagine a table with a 1 million rows, and you return pages in 100 item
chunks. To get page 500, using the OFFSET method, you would have to scan
through 100*499 records before you even get to the first record you care
about.

This can lead to all sorts of cascading problems because slow queries can
cause back pressure build up from other queries queuing up. So in order to
scale, services push the complexity out to the client. Conceivably, you can
cache the indices on the client side, and abstract a page-number based
paginator on top of it.

Here's an article talking through the issue:
[https://www.moesif.com/blog/technical/api-design/REST-API-
De...](https://www.moesif.com/blog/technical/api-design/REST-API-Design-
Filtering-Sorting-and-Pagination/#)

~~~
QueensGambit
Sorry! I don't think this is developer friendly. You are just pushing your
problem to the developers. Do you really think developers will be able to make
it faster than your implementation? Take your example of caching indices. Do I
cache for each search filter and where do I store them? And again, do I
serially execute all pages to cache cursors to build pagination on top of
them?

------
slap_shot
I run a company that integrates data from hundreds of sources, including
Shopify. There are only about 5 simple mistakes that most data extraction APIs
mess up, and this is one of them.

Very few APIs that implement pagination work optimally. If I query for all
orders whose updated_at is greater than 2019-10-23 07:00:00, and paginate
through the results, there's a good chance that any record updated before my
pagination completes will be missed by the paginated queries. If I
"checkpoint" the greatest updated_at retrieved in the most recent query, I
will likely miss the records updated after my query started but before my
query completed. Leaving me to use the start time that I began retrieving data
as my new checkpoint.

With a cursor based pagination system, there is at least a chance the service
that I'm calling to is dynamically adjusting their underlying query to account
for this scenario.

Out of curiosity, what are the scenarios when you want to jump between pages
(e.g. not iterate over the pages in a serial fashion)?

~~~
QueensGambit
Can you please list the other 4 mistakes APIs make?

~~~
slap_shot
Of course! I'm actively trying to standardize data extraction APIs, so I'm
always happy to spread the good word. :)

1\. The API should expose parameters for incrementally retrieving data by, at
the very least, the date the record was created, the date the record was
updated, and by a monotonic increasing value. The API should also expose
parameters for sorting data by, again, at least the three fields mentioned
earlier. These parameters should be consistent across _every_ endpoint.

2\. Every record in the system should have a unique identifier. For instance,
in Shopify, each record in order.tax_lines should have a unique identifier.
Since they do not, you have to create a composite key of the order_id and its
index, or take a hash from some set of values in the record.

3\. Every entity in the system should be retrievable by the parameters exposed
in rule #1. To get nested entities, many APIs require you fetch the "parent"
entity, and then make a separate API call for each record to get child entity.
For instance /customers might return 200 records, and then for each customer,
you have to call /customer/{{customer_id}}/payment_sources. So you now have to
make 201 requests to get all customers' payment_sources, when there should
just be a /customer_payment_sources endpoint.

4\. The API should implement a leaky bucket algorithm with known and
documented rules, allowing for the caller to configure an exponential backoff
that is optimized for that API. The API should have a maximum limit of time
(e.g. 5 minutes) after which no requests are made that the API completely
resets the bucket.

Smaller issues, but still annoying:

* represent _all_ timestamps in a common format (I think yyyy-MM-dd HH:mm:ss is best. I do not like unix timestamps since you cannot easily programmatically tell if it is a timestamp or just a large integer).

* return results in JSON (still lots of XML being tossed around)

* keep a change log that can be programmatically checked. We spend hours a week checking APIs to see if things have changed.

One last thought: Webhooks are becoming standard for many APIs, which is
great. However, Webhooks only expose data going forward. Many businesses build
of years of valuable in the origin system that cannot be retreived without a
proper extraction API. Those should be built first, then webhooks (imo).

We've built some pretty awesome code to allow us to retrieve data from
virtually anywhere, but if companies would just follow the rules above, it
would be 10-100x easier for everybody involved.

~~~
QueensGambit
This is great! You should create an API usability guide!

On webhooks, how does the caller (like Shopify) care if the transaction is
completed without errors? Is there any message queue built specifically for
webhooks?

~~~
slap_shot
The bullet points I listed were quick summaries of a larger internal wiki that
we are building and working directly with data providers to implement. We'll
definitely slim it down to a nice guide for developers to reference when
beginning the journey of building one of these APIs.

> On webhooks, how does the caller (like Shopify) care if the transaction is
> completed without errors?

Most services have rules for the number of times they will attempt to execute
any single request before stopping. Further, they typically have rules for how
many times they will execute _any_ request before finally unregistering the
webhook. Some will queue the events up and try for days, others will
unregister the webhook after 5 failed attempts in a window.

------
fiedzia
Because abstraction of pages is to expensive to maintain for larger datasets,
and very confusing for data that is dynamically generated.

~~~
QueensGambit
I can understand it is expensive to maintain. But, how does cursors solve
dynamically generated data? In fact, executing serially increases the chances
on missing lot more dynamic data rather than getting all pages at once.

~~~
fiedzia
You send a query. Service determines the is more than one page of data and
last id from first page is X. Next request checks for data matching query and
id>X. All it needs to do is find first page of results and check if there is a
chance for next page. This means very little work needs to be done, and if
some entries will be added or removed between requests, its not a problem.

If you go to second page (id>X), and back (id<X) you may get different
results, that's how changes are handled. Assuming you have index on id (or
whatever parameters are used), this guarantees predictable response times.

A cursor is usually just association of few parameters(query parameters, sort
order and last id).

With paginations you need to: \- figure out how many results you have, which
may require scanning all of them (very expensive) \- handle case where client
requested page one and then jumped to page 10000, which means you have to scan
a lot of data to figure out what should be on that page (if it exists) \-
figure out what to do when client requests page 1, page2, page 1, while data
changed. Do you cache results? (expensive, and leads to showing stale data),
show different data (confusing for user, where is the thing I was just looking
at gone? why things jump between pages? why last record from page one
reappears on page 2?)

------
nyuszika7h
I think cursors make sense because they ensure the data doesn't change
unexpectedly under you while you're paginating.

