Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Internet Archive has an HTTP header called "X-Archive-Wayback-Perf:"

I can guess what it means but maybe someone here has some insight?

It certainly looks like their Tengine (nginx) servers are configured to expect pipelined requests. It has no problem with greater than 100 requests at a time. See HTTP header above.

Downloading each snapshot one at a time, i.e., many connections, one after the other, perhaps each triggering a TIME_WAIT and consuming resources, may not be the most sensible or considerate approach. If just requesting the history of a single URL, maybe pipelined requests over a single connection is more efficient? I'm biased and I could be wrong.

However their robots.txt says "Please crawl our files." I would guess that crawlers use pipelining and minimize the number of open connections.

I have had my own "wayback downloader" for a number of years, written in shell script, openssl and sed. It's fast.

IA is one of the best sites on the www. Have fun.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: