Hacker News new | past | comments | ask | show | jobs | submit login
On incomplete HTTP reads and the requests library in Python (petrzemek.net)
92 points by s3rvac on April 22, 2018 | hide | past | favorite | 31 comments

This is one of the most frustrating issues I've had with the requests/urllib3 libraries. I have no idea what I'm supposed to do with half of an HTML page or a PDF or a PNG. And even if it did, I certainly don't know how I'm supposed to handle those things when I don't get an indication that I'm likely dealing with an incomplete response. I can construct a use case where I might want this behavior (maybe a FLIF image), but it feels like such a huge outlier to me, that this being the default behavior of requests, which caters to making the common cases easy, has always felt a bit odd.

I don't begrudge the authors for this at all. After all, these are people donating their time to work on libraries solving hard problems for free that I can then use as part of my job. There is nothing that has stopped me from forking these libraries and changing the behavior myself, other than a lack of time and that, while I dislike this behavior, its only is a problem somewhat infrequently. All that being said, I really wish the response to this issue has been something other than what feels like a language-lawerly reading of the spec (requests isn't an "agent" and so that sentence in the spec doesn't apply) and the theory that someone _might_ be able to do something with an incomplete response _some_ of the time and as such, the vastly more common case should be made much more complicated.

But, anyway, I'm glad to hear that this issue is being addressed and I thank the authors for their work!

Even after this is fixed the way you want it, webservers will still send you half of a PNG because it's a truncated file on the webserver. The webserver will give the truncated length as the length, the actual transfer will be the truncated length, requests will think all is well, and you'll have a problem.

We can't fix every problem. And fixing every problem shouldn't be a gating function for fixing some problems. Web servers generally want to send a complete response. And sometimes some middle box chops off the response halfway through. And if that origin server set a correct Content-Length header and the response isn't that long, we can be fairly certain that the response is probably incomplete. And I really can't come up with a non-contrived use-case where we want those incomplete responses to be delivered silently to client code.

It seems like what you are saying is that because a file might be corrupt we shouldn't worry about a completely unrelated case where the file is fine but the transfer is incomplete. By that logic, why do we worry about trying to report errors at all?

Er, that's not what I'm saying.

I'm saying that if you get a response from a webserver, you have to think about it being truncated. Period. No matter what the libraries you're using do for the many different cases of problem that can cause truncation.

You seem to have read a lot into my words that I didn't intend to put there.

I fully agree that there can always be garbage coming back and that any robust client needs to be aware of that happening. I'm sorry if I read too much into what you were saying, but, it sounded like you were arguing that the change shouldn't be made because there are other ways that corruption could occur that this change wouldn't solve. But, it sounds like you were just warning that this change wouldn't be a panacea? Is that accurate?

I didn't want to express any opinion on the change; I use aiohttp for my python-powered web-scale crawling needs. So yes, it's accurate that I was warning that the change would be incremental and not a panacea.

BTW aiohttp has the problem of currently having a too-strict http protocol parser, so it throws errors for many rare cases of bad webservers, which browsers don't have a problem with. As a crawler, I need to be able to work with whatever a browser will display, ...

Well, it sounds like we are in fierce agreement then. Sorry for misunderstanding your comment earlier.

I've been interesting lately in exploring at aiohttp, and your comment about it being too strict is certainly very enlightening. And I want to second your comment about libraries following the lead of what browsers will do. My strong feeling is that the battle over what HTTP libraries should do has been fought and been decided by browsers, and that whatever browsers do is what HTTP libraries should do as well, regardless of how ugly it is (IIRC, when I last checked, browsers did pay attention to the Content-Length header and wouldn't display results that were shorter than it - but if I'm misremembering, I would happily change my position with respect to honoring this header). The purist in me hates to say that, but, the pragmatist wants to get things done and fighting against browser behavior feels counter-productive at this stage.

I've written an HTTP service that offers "tail -f" as a service. When you GET a resource with a Range: bytes=0- (or some other starting offset with unspecified end offset) you get chunked transfer-encoding that doesn't terminate until the file is unlinked or renamed out of the way.

This is incredibly handy, both for the usual things one might want to tail -f (log files), and as a cheap-but-very-functional pub/sub system.

One thing I've learned is to watch out for silly proxies (client or reverse) that want to read the whole thing and then re-serve it as definite-length (i.e., not chunked, with Content-Length) or which impose a timeout on the transfer.

HTTP needs a way to say that a chunked encoding is of indefinite length, though arguably the Range: header in my case ought to be all the hint the proxies (and libraries!) need.

Isn’t the chunk the unit, why should it be unbounded m? You’d have issues on the client side without knowing how long a chunk is going to be - It seems like the chunk should be the unit which can be guaranteed to fit in memory on the client.

Chunked transfer encoding means that the sender sends <length>\r\n<chunk-of-that-length>\r\n... but each chunk can be of different length. A zero-length chunk terminates the transfer.

Chunked transfer encoding is used when the sender doesn't yet know the length of the resource's representation.

You could consider using websockets, which are intended to be used the way you're trying to use chunking.

I'm aware. But this is easier to use. For example, here's how you tail some log file: curl -H 'Range: bytes=0-' https://foo.example/bar.log.

Is this OSS? Is it available for use anywhere?

Not yet. I've asked permission to open source it. It's basically epoll-based, evented, and it's darned simple.

As for the compression and resulting discrepancy of length: if you are doing transparent HTTP compression, you are supposed to send Transfer-Encoding: gzip (which implies chunked mode and that Content-Length shoul not be present and should be ignored by UA). Content-Encoding: gzip means that the resource itself is gzip compressed (which means that the UA should decompress it for displaying to user, but should save it to disk compressed, IE and Safari gets this wrong)

Ahhhhh... Thanks! I've always wondered why occasionally .tar.gz downloads end up ungzipped on disk (as just a .tar). Guess it was when using IE or Safari.

>Secondly, if we throw an exception we irrevocably destroy the data we read. It becomes impossible to access. This means that situations where the user might want to ‘muddle through’, taking as much of the data as they were able to read and keeping hold of it, becomes a little bit harder.

This is an awful justification. They should do this by raising an exception and attaching the response data to it, so the user who wants to "muddle through" can catch the exception and "muddle through" explicitly.

Regardless, this is a clear violation of https://en.wikipedia.org/wiki/Fail-fast and disappointing behavior from a package that prides itself on clear, obvious API behavior.

I fully agree with you.

Altough I really like requests, this is a violation of one of the PEP 20 heuristics:

Errors should never pass silently. Unless explicitly silenced.


An exception would be an elegant way to handle the problem and would be able to retain the incomplete data to be handeled. It's exactly what they are for. The exception could be the default to prevent surprises, possible deactivated with a flag.

I am confused about the test used here. The `requests` docs very plainly state that `Response.ok` only checks the status code. Looking into the codebase proves that as well. Is there a status code for "I am going to send bad-length content"?

This might be a UX problem and not a technical one. The author seems to think that `Response.ok` should be an all-purpose check.

Its completely a UX issue, but it seems reasonable to me. I agree itS not surprising behavior given the docs, but I believe it should be added.

Its been a while since i used Requests, but iirc response.ok is basically syntax sugar; but it seems to me that in most valid usecases where you'd like this sugar (over being explicit in your actions), is when you'd like to verify the communications was correct. And malformed http is not correct. I imagine if you implemented a wrapper ok2, of correctness check + response.ok, you'd see 90% of response.ok become ok2

It seems to me to be a sensible check (validate that the http message meets the standard), that should exist in any http library at request's level. And response.ok seems like a wasted api slot, if its not meeting the full needs of its sugaring

Honestly? I'm with the author here. "Response.ok" should tell me that the response is ok. It's the most natural expectation. If it's only going to check status code, Response.status_code.ok would be more reasonable.

It's the most natural expectation. If it's only going to check status code, Response.status_code.ok would be more reasonable.

Except that the status line is literally "HTTP 200 OK", and the 'ok' check is simply using the domain terminology. This is not a bad thing. And having 'request.ok' be a general "I have examined every possible aspect of this response for problems and found none" is probably impossible, despite being what you apparently expect.

The solution is not to rename the method, it's to also warn or error when there's something fishy in the response. Which is apparently what requests 3.x will do.

There's also the issue that the Content-Length header, while generally well-adopted, is optional (sending either Content-Length or Transfer-Encoding is a SHOULD in RFC 7230, not a MUST), and sending an incorrect Content-Length, while annoying, is not actually against the RFC (the only MUST NOT prohibitions on an incorrect Content-Length value in the response are when the request was a HEAD or a conditional GET).

I'm not sure I agree with your assessment that sending an incorrect Content-Length is permitted. This specific post quotes a section in RFC-2616 that deals with this directly[1]:

  When a Content-Length is given in a message where a message-body is allowed, its field value MUST exactly
  match the number of OCTETs in the message-body.
According to this, the Content-Length MUST match the data. Can you share an example of the opposite being stated?

[1] https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4....

RFC2616 is obsoleted by RFC7230. Relevant section:


Occurrences of MUST NOT for the server in that section are:

A sender MUST NOT send a Content-Length header field in any message that contains a Transfer-Encoding header field.

(so can't send both Content-Length and Transfer-Encoding)

A server MAY send a Content-Length header field in a response to a HEAD request (Section 4.3.2 of [RFC7231]); a server MUST NOT send Content-Length in such a response unless its field-value equals the decimal number of octets that would have been sent in the payload body of a response if the same request had used the GET method.

(if you send Content-Length on HEAD, it must match the length of the response you would have sent for GET)

A server MAY send a Content-Length header field in a 304 (Not Modified) response to a conditional GET request (Section 4.1 of [RFC7232]); a server MUST NOT send Content-Length in such a response unless its field-value equals the decimal number of octets that would have been sent in the payload body of a 200 (OK) response to the same request.

(you can send Content-Length on conditional GET; if you do, must match length of the response you would have sent for GET)

A server MUST NOT send a Content-Length header field in any response with a status code of 1xx (Informational) or 204 (No Content). A server MUST NOT send a Content-Length header field in any 2xx (Successful) response to a CONNECT request (Section 4.3.6 of [RFC7231]).

(some other situations that can't use Content-Length)

It may be an oversight, but nothing in that section requires that the Content-Length, in general, must match the size of the response body.

I'd say the HEAD section is the best implication about the behavior they expect. Where in this section there is precedent set for the spec requiring it to be equal.

I'll agree and admit that implication is not explicit specification, but I think it's a reasonable example to follow.

It's worth noting that browsers allow a lot of things that either look like errors to a human or are forbidden by the RFCs. Some people want defacto behavior, others want dejure.

Except that the status line is literally "HTTP 200 OK", and the 'ok' check is simply using the domain terminology.

Well, not exactly. From the documentation of .ok: "This is not a check to see if the response code is 200 OK."

Technically, it checks if status code is not between 400 and 600.

I agree with your point that there should be some baked in capability for a more comprehensive check on the HTTP response. There are probably some edge cases around this regarding stream=True (load header only, not content) and iter_content() which pulls chunks from a buffer so the entire content isn't loaded into memory at once. Probably surmountable.

FWIW Response.status_code is an int, so it might need to be a member function on Response.

Response.status_code is an int

No, it acts like an int; whether it is one (should be) irrelevant :)

    class StatusCode(int):
        def ok(self):
            return 200 <= self < 300

If a HTTP response is malformed, the most natural expectation for an exception to be thrown.

I'm glad to see the requests 3.0 does this by default for invalid Content-Length.

To be fair, basically the whole reason requests exists is to try and have good UX for devs.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact