To conserve host resources RFC 2616 recommends making multiple HTTP requests ove...

1vuio0pswjnm7 · on Nov 24, 2021

Heres another way to do it without the subshell, using tr.

       #!/bin/sh
       IFS=/;while read w x y z;do
       v=$(echo x|tr x '\34');
       case $w in http:|https:);;*)exit;esac;
       case $x in "");;*)exit;esac;
       echo $y > .host
       printf '%s\r\n' "GET /$z HTTP/1.1";
       printf '%s\r\n' "Host: $y";
       printf 'Connection: keep-alive'$v$v;done \
       |sed '$s/keep-alive/close/'|tr '\34\34' '\r\n' > .http;
       read x < .host;
       # SNI;
       #openssl s_client -connect $x:443 -ign_eof -servername $x < .http;
       # no SNI;
       openssl s_client -connect $x:443 -ign_eof -noservername < .http;
       exec rm .host .http;

1vuio0pswjnm7 · on Nov 24, 2021

   case $x in "");;*)exit;esac

is better written as

   test ${#x} = 0||exit

themk · on Nov 24, 2021

The curl binary will reuse the TCP connection when fed multiple URLs. Infact it can even use HTTP2 and make the requests in parallel over a single TCP connection. Common pattern I use is to construct URLs with a script and use xargs to feed to curl.

1vuio0pswjnm7 · on Nov 24, 2021

For HTTP/1.1 pipelining, not HTTP/2 which not all websites support, the curl binary must be slower because the program tries to do more than just make HTTP from URLs and send the text over TCP. It tries to be "smart" and that slows it down. But dont't take my word for it, test it.

For example, compare the retrieval speed of the above to something like

     sed -n '/^http/s/^/url=/p' URLs.txt|curl -K- --http1.1

themk · on Nov 24, 2021

I was responding to your original comment, which has since been edited:

> When fed mutiple URLs, the curl binary will open multiple TCP connections, consuming more resources on the host.

Which I felt was a bit of an unfair thing to say.

I have no issue with the rest :)

1vuio0pswjnm7 · on Nov 24, 2021

Although historically it was true, I agree it was an unfair statement thus I removed it. Thank you for the correction. I will be testing curl a bit more; I rarely ever use it. I could never trust cURL for doing fast HTTP/1.1 pipelining so I am admittedly skeptical. It is a moving target that is constantly changing. The author clearly has a bias toward HTTP/2, and never really focused on HTTP/1.1 pipelining support even though HTTP/1.1 has, IME, worked really well for bulk text retrieval from a single host using a wide variety of smaller simpler programs,^1 not a graphical web browser, for at least fifteen years.^2 HTTP/2 is designed for graphical web pages full of advertising and tracking; introduced as a "standard" by a trillion dollar web advertising company who also controls the majority share web browser. Thankfully, HTTP/1.1 does not have that baggage.

1. Original netcat, djb's tcpclient, socat, etc.

2. It could be used to retrieve lots of small binary files too. See phttpget.

saltcured · on Nov 25, 2021

I'm not sure if you realize it but connection reuse and pipelining are not the same thing. Curl does connection reuse with HTTP/1.1 but not pipelining. Connection reuse is what was described up-thread, using a single connection to convey many requests and responses. Pipelining is a step further where the client pushes multiple requests down the connection before receiving responses.

Pipelining is problematic with HTTP/1.1 because servers are allowed to close a connection to signal the end of a response, rather than announcing the Content-Length in the response header. The 1.1 protocol also requires that responses to pipelined requests come in the same order as the original requests. This is awkward and most servers do not bother to parallelize request processing and response writing. Even if the client sends requests back-to-back, the responses will come with delays between them.

With HTTP/1.1 curl (like most clients) will do non-pipelined connection reuse. They push one request, read one response, then push the next request, etc. The network traffic will be small bursts with gaps between them, as each request-response cycle takes at least a round-trip delay. This is still faster than using independent connections, where extra TCP connection setup happens per request, and this is even more significant for HTTPS.

1vuio0pswjnm7 · on Nov 25, 2021

I do realise it. I have little interest in curl. It is far too complicated with too many features, IMO. Anyway, libcurl used to support "attempted pipelining". See https://web.archive.org/web/20101015021914if_/http://curl.ha... But the number of people writing programs using CURLMOPT_PIPELINING always seemed small. Thus, I started using netcat instead. No need for libcurl. No need for the complexity.

HTTP/1.1 pipelining was nor is ever "problematic" for me. Thus I cannot relate to statements that try to suggest it is "problematic", especially without providing a single example website.

I am not looking at graphical webpages. I am not trying pull resources from multiple hosts. I am retrieving text from a single host, a continuous stream of text. I do not want parallel processing. I do not want asynchronous. I want synchronous. I want responses in the same order as requests. This allows me to use simple methods for verifying responses and processing them. HTTP/2 is far more complicated. It also has bad ideas like "server push" which is absolutely not what I am looking for.

There are some sites that disable HTTP/1.1 pipelining; they will send a Connection: close header. These are not the majority. There are also some sites where there is a noticeable delay before the first reponse or between responses when using HTTP/1.1 pipelining. That is also a small minority of sites. Most have no noticeable delay. Most are "blazingly fast". "HOB blocking" is not important to me if it is so small that I cannot notice it. If HTTP/1.1 pipelining is "blazingly fast" 98% of the time for me, I am going to use it where I can, which, as it happens, is almost everywhere.

Even with the worst possible delay I have ever experienced, HTTP/1.1 pipelining is still faster than curl/wget/lftp/etc. That is in practice, not theory. People who profess expertise in www technical details today struggle to even agree on what "pipelining" means. For example, see https://en.wikipedia.org/wiki/Talk:HTTP_Pipelining. Trying to explain that socket reuse differs from HTTP/1.1 pipelining is not worth the effort and I am not the one qualified to do it. But I am qualified to state that for text retrieval from single host, HTTP/1.1 pipelining works on most websites and is faster than curl. Is it slower than nghttp2. If yes, how much slower. We know the theory. We also know we cannot believe everything we read. Test it.

1vuio0pswjnm7 · on Nov 25, 2021

One more thing: cURL only supported pipelining GET and HEAD. And on www forums where people try to sound like experts, I have read assertions that POST requests cannot be pipelined. Makes sense in theory, but I know I tried pipelining POST requests before and, surprisingly, it worked on at least one site. Using curl's limited "attempted pipelining", one could never discover this, which is another example of how programs with dozens of features can stil be very inflexible. If anyone doubts this is true, I can try to remember the site that answered pipelined POST requests.

saltcured · on Nov 25, 2021

I think the limitation on methods is related to error handling and reporting, and ambiguity as to whether the "extra" requests on the wire have been processed or not. It's the developers of the tools, who read the specifications, who find the topic "problematic." For whatever reason, there is a mildly paternalistic culture among web plumbing developers. More than in some other disciplines, they seem to worry more about offering unsafe tools, and focusing on easy to use "happy paths" rather than offering flexibility which requires very careful use.

Going back to pipelining, the textbook definition is all about concurrent use of every resource along the execution path in order to hit maximum throughput. As if the "pipe" from client application to server and back to client is always full of content/work from the start of the first input in the stream until the end of the last output. That was rarely achieved with HTTP/1.1 because of the way most of the middleware and servers were designed. Even if you could managed to pipeline your request inputs to keep the socket full from client to server, the server usually did not pipeline its processing and responses. Instead, the server alternated between bursts of work to process a request and idle wait periods while responses were sent back to the client. How much this matters in practice depends on the relative throughput and latency measures of all the various parts in your system.

I measured this myself in the past, using libcurl's partial pipelining with regular servers like Apache. I could get much faster upload speeds with pipelined PUT requests, really hitting the full bandwidth for sending back-to-back message payloads that kept the TCP path full. But, pipelined GET requests did not produce pipelined GET responses, so the download rate was always a lower throughput with measurable spikes and delays as the socket idled briefly between each response payload. For our high bandwidth, high latency environment, the actual path could be measured as having symmetric capacity for TCP/TLS. The pipelined uploads got within a few percent of that, while the non-pipelined downloads had an almost 50% loss in throughput.

If I were in your position and continued to care about streaming requests and responses from a scripting environment, I might consider writing my own client-side tool. Something like curl to bridge between scripts and network, using an HTTP/2 client library with an async programming model hooked up to CLI/stdio/file handling conventions that suit my recurring usage. However, I have found that the rest of the client side becomes just as important for performance and error handling if I am trying to process large numbers of URLs/files. So, I would probably stop thinking of it as a conventional scripting task and instead think of it more like a custom application. I might write the whole thing in Python and worry about async error handling, state tracking, and restart/recovery to identify the work items, handle retries as appropriate, and be able to confidently tell when I have finished a whole set of work...

1vuio0pswjnm7 · on Nov 27, 2021

s/HOB/HOL/