
On incomplete HTTP reads and the requests library in Python - s3rvac
https://blog.petrzemek.net/2018/04/22/on-incomplete-http-reads-and-the-requests-library-in-python/
======
dagenix
This is one of the most frustrating issues I've had with the requests/urllib3
libraries. I have no idea what I'm supposed to do with half of an HTML page or
a PDF or a PNG. And even if it did, I certainly don't know how I'm supposed to
handle those things when I don't get an indication that I'm likely dealing
with an incomplete response. I can construct a use case where I might want
this behavior (maybe a FLIF image), but it feels like such a huge outlier to
me, that this being the default behavior of requests, which caters to making
the common cases easy, has always felt a bit odd.

I don't begrudge the authors for this at all. After all, these are people
donating their time to work on libraries solving hard problems for free that I
can then use as part of my job. There is nothing that has stopped me from
forking these libraries and changing the behavior myself, other than a lack of
time and that, while I dislike this behavior, its only is a problem somewhat
infrequently. All that being said, I really wish the response to this issue
has been something other than what feels like a language-lawerly reading of
the spec (requests isn't an "agent" and so that sentence in the spec doesn't
apply) and the theory that someone _might_ be able to do something with an
incomplete response _some_ of the time and as such, the vastly more common
case should be made much more complicated.

But, anyway, I'm glad to hear that this issue is being addressed and I thank
the authors for their work!

~~~
greglindahl
Even after this is fixed the way you want it, webservers will still send you
half of a PNG because it's a truncated file on the webserver. The webserver
will give the truncated length as the length, the actual transfer will be the
truncated length, requests will think all is well, and you'll have a problem.

~~~
dagenix
We can't fix every problem. And fixing every problem shouldn't be a gating
function for fixing some problems. Web servers generally want to send a
complete response. And sometimes some middle box chops off the response
halfway through. And if that origin server set a correct Content-Length header
and the response isn't that long, we can be fairly certain that the response
is probably incomplete. And I really can't come up with a non-contrived use-
case where we want those incomplete responses to be delivered silently to
client code.

It seems like what you are saying is that because a file might be corrupt we
shouldn't worry about a completely unrelated case where the file is fine but
the transfer is incomplete. By that logic, why do we worry about trying to
report errors at all?

~~~
greglindahl
Er, that's not what I'm saying.

I'm saying that if you get a response from a webserver, you have to think
about it being truncated. Period. No matter what the libraries you're using do
for the many different cases of problem that can cause truncation.

You seem to have read a lot into my words that I didn't intend to put there.

~~~
dagenix
I fully agree that there can always be garbage coming back and that any robust
client needs to be aware of that happening. I'm sorry if I read too much into
what you were saying, but, it sounded like you were arguing that the change
shouldn't be made because there are other ways that corruption could occur
that this change wouldn't solve. But, it sounds like you were just warning
that this change wouldn't be a panacea? Is that accurate?

~~~
greglindahl
I didn't want to express any opinion on the change; I use aiohttp for my
python-powered web-scale crawling needs. So yes, it's accurate that I was
warning that the change would be incremental and not a panacea.

BTW aiohttp has the problem of currently having a too-strict http protocol
parser, so it throws errors for many rare cases of bad webservers, which
browsers don't have a problem with. As a crawler, I need to be able to work
with whatever a browser will display, ...

~~~
dagenix
Well, it sounds like we are in fierce agreement then. Sorry for
misunderstanding your comment earlier.

I've been interesting lately in exploring at aiohttp, and your comment about
it being too strict is certainly very enlightening. And I want to second your
comment about libraries following the lead of what browsers will do. My strong
feeling is that the battle over what HTTP libraries should do has been fought
and been decided by browsers, and that whatever browsers do is what HTTP
libraries should do as well, regardless of how ugly it is (IIRC, when I last
checked, browsers did pay attention to the Content-Length header and wouldn't
display results that were shorter than it - but if I'm misremembering, I would
happily change my position with respect to honoring this header). The purist
in me hates to say that, but, the pragmatist wants to get things done and
fighting against browser behavior feels counter-productive at this stage.

------
cryptonector
I've written an HTTP service that offers "tail -f" as a service. When you GET
a resource with a Range: bytes=0- (or some other starting offset with
unspecified end offset) you get chunked transfer-encoding that doesn't
terminate until the file is unlinked or renamed out of the way.

This is incredibly handy, both for the usual things one might want to tail -f
(log files), and as a cheap-but-very-functional pub/sub system.

One thing I've learned is to watch out for silly proxies (client or reverse)
that want to read the whole thing and then re-serve it as definite-length
(i.e., not chunked, with Content-Length) or which impose a timeout on the
transfer.

HTTP needs a way to say that a chunked encoding is of indefinite length,
though arguably the Range: header in my case ought to be all the hint the
proxies (and libraries!) need.

~~~
batbomb
Isn’t the chunk the unit, why should it be unbounded m? You’d have issues on
the client side without knowing how long a chunk is going to be - It seems
like the chunk should be the unit which can be guaranteed to fit in memory on
the client.

~~~
cryptonector
Chunked transfer encoding means that the sender sends <length>\r\n<chunk-of-
that-length>\r\n... but each chunk can be of different length. A zero-length
chunk terminates the transfer.

Chunked transfer encoding is used when the sender doesn't yet know the length
of the resource's representation.

------
dfox
As for the compression and resulting discrepancy of length: if you are doing
transparent HTTP compression, you are supposed to send Transfer-Encoding: gzip
(which implies chunked mode and that Content-Length shoul not be present and
should be ignored by UA). Content-Encoding: gzip means that the resource
itself is gzip compressed (which means that the UA should decompress it for
displaying to user, but should save it to disk compressed, IE and Safari gets
this wrong)

~~~
tonyarkles
Ahhhhh... Thanks! I've always wondered why occasionally .tar.gz downloads end
up ungzipped on disk (as just a .tar). Guess it was when using IE or Safari.

------
crdoconnor
>Secondly, if we throw an exception we irrevocably destroy the data we read.
It becomes impossible to access. This means that situations where the user
might want to ‘muddle through’, taking as much of the data as they were able
to read and keeping hold of it, becomes a little bit harder.

This is an awful justification. They should do this by raising an exception
and attaching the response data to it, so the user who wants to "muddle
through" can catch the exception and "muddle through" explicitly.

Regardless, this is a clear violation of [https://en.wikipedia.org/wiki/Fail-
fast](https://en.wikipedia.org/wiki/Fail-fast) and disappointing behavior from
a package that prides itself on clear, obvious API behavior.

~~~
cessor
I fully agree with you.

Altough I really like requests, this is a violation of one of the PEP 20
heuristics:

Errors should never pass silently. Unless explicitly silenced.

[https://www.python.org/dev/peps/pep-0020/](https://www.python.org/dev/peps/pep-0020/)

An exception would be an elegant way to handle the problem and would be able
to retain the incomplete data to be handeled. It's exactly what they are for.
The exception could be the default to prevent surprises, possible deactivated
with a flag.

------
llccbb
I am confused about the test used here. The `requests` docs very plainly state
that `Response.ok` only checks the status code. Looking into the codebase
proves that as well. Is there a status code for "I am going to send bad-length
content"?

This might be a UX problem and not a technical one. The author seems to think
that `Response.ok` should be an all-purpose check.

~~~
Twirrim
Honestly? I'm with the author here. "Response.ok" should tell me that the
response is ok. It's the most natural expectation. If it's only going to check
status code, Response.status_code.ok would be more reasonable.

~~~
ubernostrum
_It 's the most natural expectation. If it's only going to check status code,
Response.status_code.ok would be more reasonable._

Except that the status line is literally "HTTP 200 OK", and the 'ok' check is
simply using the domain terminology. This is not a bad thing. And having
'request.ok' be a general "I have examined every possible aspect of this
response for problems and found none" is probably impossible, despite being
what you apparently expect.

The solution is not to rename the method, it's to also warn or error when
there's something fishy in the response. Which is apparently what requests 3.x
will do.

There's also the issue that the Content-Length header, while generally well-
adopted, is optional (sending either Content-Length or Transfer-Encoding is a
SHOULD in RFC 7230, not a MUST), and sending an incorrect Content-Length,
while annoying, is not actually against the RFC (the only MUST NOT
prohibitions on an incorrect Content-Length value in the response are when the
request was a HEAD or a conditional GET).

~~~
TheSwordsman
I'm not sure I agree with your assessment that sending an incorrect Content-
Length is permitted. This specific post quotes a section in RFC-2616 that
deals with this directly[1]:

    
    
      When a Content-Length is given in a message where a message-body is allowed, its field value MUST exactly
      match the number of OCTETs in the message-body.
    

According to this, the Content-Length MUST match the data. Can you share an
example of the opposite being stated?

[1]
[https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4....](https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.4)

~~~
ubernostrum
RFC2616 is obsoleted by RFC7230. Relevant section:

[https://tools.ietf.org/html/rfc7230#section-3.3.2](https://tools.ietf.org/html/rfc7230#section-3.3.2)

Occurrences of MUST NOT for the server in that section are:

 _A sender MUST NOT send a Content-Length header field in any message that
contains a Transfer-Encoding header field._

(so can't send both Content-Length and Transfer-Encoding)

 _A server MAY send a Content-Length header field in a response to a HEAD
request (Section 4.3.2 of [RFC7231]); a server MUST NOT send Content-Length in
such a response unless its field-value equals the decimal number of octets
that would have been sent in the payload body of a response if the same
request had used the GET method._

(if you send Content-Length on HEAD, it must match the length of the response
you would have sent for GET)

 _A server MAY send a Content-Length header field in a 304 (Not Modified)
response to a conditional GET request (Section 4.1 of [RFC7232]); a server
MUST NOT send Content-Length in such a response unless its field-value equals
the decimal number of octets that would have been sent in the payload body of
a 200 (OK) response to the same request._

(you can send Content-Length on conditional GET; if you do, must match length
of the response you would have sent for GET)

 _A server MUST NOT send a Content-Length header field in any response with a
status code of 1xx (Informational) or 204 (No Content). A server MUST NOT send
a Content-Length header field in any 2xx (Successful) response to a CONNECT
request (Section 4.3.6 of [RFC7231])._

(some other situations that can't use Content-Length)

It may be an oversight, but nothing in that section requires that the Content-
Length, in general, must match the size of the response body.

~~~
TheSwordsman
I'd say the HEAD section is the best implication about the behavior they
expect. Where in this section there is precedent set for the spec requiring it
to be equal.

I'll agree and admit that implication is not explicit specification, but I
think it's a reasonable example to follow.

