
REST and long-running jobs (2014) - geezerjay
https://farazdagi.com/2014/rest-and-long-running-jobs/
======
epaulson
One thing this piece leaves out is that if the operation/creation is in fact
expensive, you want to make sure you have some kind of two-phase option so you
don't inadvertently create multiple copies.

If the client crashes after the POST to create 'death star' succeeds and hence
construction starts, but before it can read the response of the POST in the
example and so never gets the pointer to the /queue/12345 from the response,
if the client reissues the POST in the server should reject a second attempt
to create a star named 'death star' just and give back the same pointer to
/queue/12345\. Or, if you want to be able to have two stars named 'death
star', you should first POST to some endpoint to get a unique identifier from
the server first, and then use that to kick off the creation of your death
star so the client can reason about the current state.

~~~
Anderkent
Or you could just generate an unique identifier on the client and save the
roundtrip? Uuids aren't really expensive.

~~~
farazbabar
UUID is not enough to prevent duplicate resource creation. Consider the
following steps: 1. Receive a request with uuid, 2. Persist the fact that
system received this request, 3. who or what processes it? Even if you are
working off atomic persistence, the resource creation process (the kick-off
process) needs to maintain the state that it plucked one off the queue or
table, atomically update the state in persistence as it creates the resource
somewhere else and log the successful creation of this resource as yet another
entry in the persistence store. This way, requests that exceed their SLA can
be retried and the downstream systems must act with correct idempotent
behavior in order for all of this to work correctly in the presence of
failures.

It is all very technically possible, but it goes beyond just a UUID (in other
words, UUID is necessary but not sufficient).

~~~
Anderkent
The issues you list don't really seem to be relevant in regard to duplicate
resource creation? They're just about handling creating the resource in the
first place.

As long as the uuid is persisted you can redirect all attempts to recreate
that id to the right job/queue/whatever actually backs resources.

------
whack
_> Your first instinct might be: “What if I return HTTP 201 Created
immediately, but defer the actual creation to some later point?“... Well, you
can’t do that. If you do, you will be violating the HTTP/1.1: Semantics and
Content protocol

> We know what status code to return, what about Location header? Simple.
> Instead of the location to the actual resource (/stars/97865), API will
> return location of the queued task that got created (/queue/12345)

> How to learn when resource is finally available? You need to query the
> queued task... Once resource is created, API should respond with 303 See
> Other status code on all the subsequent requests to the queued task... There
> are two alternative ways to deal with the temporary task resource: API
> client must issue DELETE request, so that server purges it. or, garbage
> collection can be a server’s job to do_

I've always wondered about how important it is to follow REST conventions, at
the cost of simplicity, when you control both the server and client software.

In the example given above, the recommendation to use a 202 instead of 201,
sounds perfectly reasonable. However, creating and returning to the client a
new queue resource, which returns a 303 when complete, and the client then has
to delete... Is it really so bad to just return /stars/98765 immediately? And
when queried, have the relevant GET API return a response indicating that
construction of the resource is still in progress? If the client needs to wait
for construction to complete before doing something else, can't it just poll
/stars/98765 instead of /queue/12345?

More generally speaking, if you're designing an API for use by your own
client, and you're facing a tradeoff between interface/implementation
simplicity vs following the official REST protocol, is it really worth
choosing the latter?

~~~
vfaronov
HTTP is an inherently complex protocol, which has over time accrued many
idiosyncratic, non-orthogonal features to support various use cases of the
growing Web. Just consider that there exists an entire class, entire design
space of libraries known as “HTTP routers” which boil down to _extracting
arguments from the first line of an HTTP request_.

If you want simplicity, and you fully control both sides, and you don’t care
about the systemic advantages that REST purports to provide, then your best
bet is to avoid HTTP altogether (which in practical terms may of course mean
tunneling through it) and stick to a simple, modern RPC protocol.

~~~
geezerjay
> Just consider that there exists an entire class, entire design space of
> libraries known as “HTTP routers” which boil down to extracting arguments
> from the first line of an HTTP request.

This comment sounds a bit disingenuous. Routing involves way more than
extracting arguments from the first line of an HTTP request, but its
complexity is not due to HTTP. Routing is based on content negotiation, and
essentially everyone is free to design their own personal content negotiation
process, and very often they do.

Take, for example, content type. Do file extensions matter? Does the Content-
Type header mean anything? If both are used what should prevail? What should
the router do if none was passed?

On top of that, then let's add HATEOAS, HAL, content type API versioning,
custom headers, etc...

In the end developers need to map HTTP requests to an action, and HTTP is not
the problem.

Libraries are helpful not because the problem is complex, but because ready-
made solutions are better than rolling our own. There are plenty of libraries
not because HTTP is complex, but because plenty of people have their personal
preference.

------
bsenftner
This author misses a key algorithm design option: the API does NOT offer a
CREATE on such a resource, the API offers a CREATE on a JOB which generates
the long-time-to-create resource. So the API client simply creates the Job,
receives an ID immediately, which can be queried for status, forever. Long
after the job completed it serves as a timestamped record of that specific
resource's creation. It is only after the Job status reaches predefined status
levels are the assets (files/data) associated with it available.

I've used this pattern with success for over a decade. First with on-demand
digital products, which required compiling/generating time after customer
request. This pattern also supports partial generation of long-time-to-
generate resources. I used it with a custom game avatar service, where various
modification and customization options are visualized without creating them.
The visualization of the modified product is a 'milestone' asset generated by
the resource creation pipeline. If the end-user chooses not to purchase said
modification, that specific asset's sub-job is killed or never started.

With complex, multi-layered digital products this pattern works quite well.
Even after a period of time, the end-user could return to a multi-leveled job
and request an asset they previously denied. That sub-job gets timestamped
just the same as if it ran originally. It just works.

~~~
geezerjay
>This author misses a key algorithm design option: the API does NOT offer a
CREATE on such a resource, the API offers a CREATE on a JOB which generates
the long-time-to-create resource. So the API client simply creates the Job,
receives an ID immediately, which can be queried for status, forever.

You should read the article because that's the approach it describes.

------
coredog64
A coworker and I were discussing this very thing on Friday. He and I exhausted
our Google-fu without success: Does anyone know of a framework/library that
implements this? Don’t really care about the language, but per other comments
we’re interested in the practical lessons learned that would show up in code.

------
vfaronov
RFC 7540 also defines an optional way for the client to signal whether it
wishes the request to be processed asynchronously:
[https://tools.ietf.org/html/rfc7240#section-4.1](https://tools.ietf.org/html/rfc7240#section-4.1)

~~~
geezerjay
My first impression is that RFC7540 isn't appropriate for this use case,
because it is proposed as a way for the client to signal optional/prefered
behavior when said behavior is already made mandatory by the server.

To put it differently, why would it matter if a client POSTs with Prefer:
respond-async if the server response will always be async? Whether that header
is present or not, the HTTP response is already designed to always return a
Location header.

------
Multicomp
Someone once gave a talk on 'the Art of API design' that covered logical
design of a RESTful API and one of their points was in close alignment with
the article.

Can't find and slides to share but will edit/reply to self if I can, the talk
was very informative and covered not just resource creation but destruction,
Discovery, and even a bit about authentication.

------
wilgertvelinga
Anybody here who used this pattern in practice and found out it is not as good
as this article portrays it?

~~~
twodave
Bottom line is if you’re doing this via REST and not using web sockets (which
is fine, of course!) then you’re going to end up in a polling situation.
That’s the important thing to think about. I had a bulk export job that ran
similar to this, minus a lot of the frills. On the initial POST you get back a
token that would allow you to poll the job status. Until you went to download
the (often several GB) payload, all requests were extremely fast. At most we
would insert a single record into a DB and queue a tiny object onto the
message bus. In the status poll it was a simple key lookup.

In my experience nobody seriously tries to figure out how to use your API by
calling it, they’ll prefer to look at your code examples and documentation.

Nobody is going to build a generic-enough REST client to make all the HATEOAS
crap matter for their workflow. In my view it’s a waste of time, but maybe
others have found it measurably useful?

~~~
geezerjay
> Bottom line is if you’re doing this via REST and not using web sockets

Websockets is not a viable solution for this problem because long-running
processes are long-running, and it would make no sense to keep a connection up
for minutes/hours just to poll the process status.

~~~
twodave
Sure, good point. I’d argue this isn’t exactly black and white but you won’t
find me defending web sockets for many applications at all anyway.

------
wcoenen
Instead you could also POST to /star-construction to get back a link to a
/star-construction/foo resource, representing a long-running creation process.

This resource could be polled to get progress updates, and eventually a link
to the created star resource, or perhaps an error describing why it didn't
work.

The advantage is that you don't have to deal with two different types of
resources being returned at the same URL, which complicates parsing the
response. If creations are queued you also can get a view on the workload
still in front of you.

------
anuraj
Better options would be:

1\. Send partial HTTP response (206) with progress details which may be
handled using AJAX progress API at the client before sending final response.
This is processing and network intensive but may be needed in certain
scenarios like drawing graphics etc. You cannot recover from connection loss.

2\. Handover the task to a job scheduler and return the jobid. Client can then
poll jobid using a status API or get the status pushed using a webhook. This
is the more optimal approach for majority of cases.

~~~
zbjornson
206 is only valid to respond to range requests, which can only be GET
requests.

~~~
geezerjay
Additionally, the 206 suggestion fails to take into account basic use cases
such as having more than one queued job.

------
shoo
I found this discussion a bit incomplete as it didn't really discuss the
consequences or costs or benefits of modelling things one way or another.

Suppose we didn't use REST or HTTP for our API. What would be a good way to
model this kind of thing?

~~~
jmts
This is essentially just an asynchronous response to some request. I expect
there are probably three cases:

1\. Response is unimportant: do nothing

2\. Response timing is not important: client polls

3\. Response timing is important: server calls back into the client somehow

The article describes an implementation of point 2. Depending on the
technologies available at the time, I'd argue 3 is probably more versatile.
Both cases 1 and 2, and the blocking case can be implemented with case 3, if
the available technology allows it.

------
sltkr
This article is missing one detail: what if resource creation fails? How do
you signal that failure when the client calls GET on the task queue resource,
in a way that's distinguishable from a failure to retrieve the task queue
status?

~~~
icebraining
That's like asking whether you should send a 404 when you serve an article
about a city that no longer exists.

HTTP status codes refer to the resource being requested (in this case, the
creation task itself), not to something that resource refers.

------
aaaaaaaaaab
“Once the originally desired resource is created, there are two alternative
ways to deal with the temporary task resource:

API client must issue DELETE request, so that server purges it. Until then,
server responds with 303 See Other status. Once deleted, 404 Not Found will be
returned for subsequent GET /queue/12345 requests. or, garbage collection can
be a server’s job to do: once task is complete server can safely remove it and
respond with the 410 Gone on subsequent GET /queue/12345 requests.”

How do you distinguish between an id referring to a finished task that no
longer exists (410 Gone), and an id that never referred to any task (404 Not
Found)? Other than the obvious solution of keeping every task in the DB with a
“finished” flag, which doesn’t scale...

~~~
naasking
> Other than the obvious solution of keeping every task in the DB with a
> “finished” flag, which doesn’t scale...

Why doesn't that scale?

~~~
aaaaaaaaaab
Because your storage requirements would grow indefinitely as you would need to
store a record for every task ever performed.

~~~
DougWebb
If your IDs are completely unique, you could add them to a Bloom filter [0]
when cleaning up the resource. Then you could use that to determine if an
incoming ID has possibly been used before or definitely not.

[0]
[https://en.wikipedia.org/wiki/Bloom_filter](https://en.wikipedia.org/wiki/Bloom_filter)

~~~
striking
Bloom filters are probabilistic, meaning they don't always produce the right
result. From the article you posted:

> False positive matches are possible, but false negatives are not – in other
> words, a query returns either "possibly in set" or "definitely not in set".
> Elements can be added to the set, but not removed (though this can be
> addressed with a "counting" filter); the more elements that are added to the
> set, the larger the probability of false positives.

So I don't think they're the right choice for this problem, if you actually
depend on the result "not here" vs "never was here" being right.

