Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How developer friendly is APIs with cursor based pagination?
6 points by QueensGambit 4 months ago | hide | past | web | favorite | 13 comments
I see companies like Shopify deprecate their existing pagination in favor of cursor based pagination. As a developer, I always hated cursor based pagination because it restricts me to serially move from one page to another. Is this developer friendly? Why do companies migrate to cursor based pagination?

My company is actually going through the transition to cursor/keyset pagination. As a SaaS company starts having to support larger customers, they will likely have to go through this exercise. The reason is because most pagination is usually implemented with LIMIT/OFFSET in a SQL query. Using OFFSET becomes very expensive for pagination queries because the query has gotta do a full scan till it reaches the desired offset.

Imagine a table with a 1 million rows, and you return pages in 100 item chunks. To get page 500, using the OFFSET method, you would have to scan through 100*499 records before you even get to the first record you care about.

This can lead to all sorts of cascading problems because slow queries can cause back pressure build up from other queries queuing up. So in order to scale, services push the complexity out to the client. Conceivably, you can cache the indices on the client side, and abstract a page-number based paginator on top of it.

Here's an article talking through the issue: https://www.moesif.com/blog/technical/api-design/REST-API-De...

Sorry! I don't think this is developer friendly. You are just pushing your problem to the developers. Do you really think developers will be able to make it faster than your implementation? Take your example of caching indices. Do I cache for each search filter and where do I store them? And again, do I serially execute all pages to cache cursors to build pagination on top of them?

The performance issue is easily solved by using an index. Getting page 500 would be just as fast as select * from table where 500100 <= order_num < 501100

I run a company that integrates data from hundreds of sources, including Shopify. There are only about 5 simple mistakes that most data extraction APIs mess up, and this is one of them.

Very few APIs that implement pagination work optimally. If I query for all orders whose updated_at is greater than 2019-10-23 07:00:00, and paginate through the results, there's a good chance that any record updated before my pagination completes will be missed by the paginated queries. If I "checkpoint" the greatest updated_at retrieved in the most recent query, I will likely miss the records updated after my query started but before my query completed. Leaving me to use the start time that I began retrieving data as my new checkpoint.

With a cursor based pagination system, there is at least a chance the service that I'm calling to is dynamically adjusting their underlying query to account for this scenario.

Out of curiosity, what are the scenarios when you want to jump between pages (e.g. not iterate over the pages in a serial fashion)?

Usually, it is for syncing products or orders. If the API allows me to query by page number, I can execute all the pages in parallel and complete the job in seconds. If it is paginated by cursor, I have to do it one by one serially which might take minutes.

Since I run this on serverless compute which has timeout (or browser timeout if user is waiting on it), splitting these sync jobs is needlessly complicated.

Can you please list the other 4 mistakes APIs make?

Of course! I'm actively trying to standardize data extraction APIs, so I'm always happy to spread the good word. :)

1. The API should expose parameters for incrementally retrieving data by, at the very least, the date the record was created, the date the record was updated, and by a monotonic increasing value. The API should also expose parameters for sorting data by, again, at least the three fields mentioned earlier. These parameters should be consistent across _every_ endpoint.

2. Every record in the system should have a unique identifier. For instance, in Shopify, each record in order.tax_lines should have a unique identifier. Since they do not, you have to create a composite key of the order_id and its index, or take a hash from some set of values in the record.

3. Every entity in the system should be retrievable by the parameters exposed in rule #1. To get nested entities, many APIs require you fetch the "parent" entity, and then make a separate API call for each record to get child entity. For instance /customers might return 200 records, and then for each customer, you have to call /customer/{{customer_id}}/payment_sources. So you now have to make 201 requests to get all customers' payment_sources, when there should just be a /customer_payment_sources endpoint.

4. The API should implement a leaky bucket algorithm with known and documented rules, allowing for the caller to configure an exponential backoff that is optimized for that API. The API should have a maximum limit of time (e.g. 5 minutes) after which no requests are made that the API completely resets the bucket.

Smaller issues, but still annoying:

* represent _all_ timestamps in a common format (I think yyyy-MM-dd HH:mm:ss is best. I do not like unix timestamps since you cannot easily programmatically tell if it is a timestamp or just a large integer).

* return results in JSON (still lots of XML being tossed around)

* keep a change log that can be programmatically checked. We spend hours a week checking APIs to see if things have changed.

One last thought: Webhooks are becoming standard for many APIs, which is great. However, Webhooks only expose data going forward. Many businesses build of years of valuable in the origin system that cannot be retreived without a proper extraction API. Those should be built first, then webhooks (imo).

We've built some pretty awesome code to allow us to retrieve data from virtually anywhere, but if companies would just follow the rules above, it would be 10-100x easier for everybody involved.

This is great! You should create an API usability guide!

On webhooks, how does the caller (like Shopify) care if the transaction is completed without errors? Is there any message queue built specifically for webhooks?

The bullet points I listed were quick summaries of a larger internal wiki that we are building and working directly with data providers to implement. We'll definitely slim it down to a nice guide for developers to reference when beginning the journey of building one of these APIs.

> On webhooks, how does the caller (like Shopify) care if the transaction is completed without errors?

Most services have rules for the number of times they will attempt to execute any single request before stopping. Further, they typically have rules for how many times they will execute _any_ request before finally unregistering the webhook. Some will queue the events up and try for days, others will unregister the webhook after 5 failed attempts in a window.

Because abstraction of pages is to expensive to maintain for larger datasets, and very confusing for data that is dynamically generated.

I can understand it is expensive to maintain. But, how does cursors solve dynamically generated data? In fact, executing serially increases the chances on missing lot more dynamic data rather than getting all pages at once.

You send a query. Service determines the is more than one page of data and last id from first page is X. Next request checks for data matching query and id>X. All it needs to do is find first page of results and check if there is a chance for next page. This means very little work needs to be done, and if some entries will be added or removed between requests, its not a problem.

If you go to second page (id>X), and back (id<X) you may get different results, that's how changes are handled. Assuming you have index on id (or whatever parameters are used), this guarantees predictable response times.

A cursor is usually just association of few parameters(query parameters, sort order and last id).

With paginations you need to: - figure out how many results you have, which may require scanning all of them (very expensive) - handle case where client requested page one and then jumped to page 10000, which means you have to scan a lot of data to figure out what should be on that page (if it exists) - figure out what to do when client requests page 1, page2, page 1, while data changed. Do you cache results? (expensive, and leads to showing stale data), show different data (confusing for user, where is the thing I was just looking at gone? why things jump between pages? why last record from page one reappears on page 2?)

I think cursors make sense because they ensure the data doesn't change unexpectedly under you while you're paginating.

Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact