- Applebot was originally written in Go (and uncovered a user agent bug on redirects, revealing it's Go origins to the world, which Russ Cox fixed the next day).
- Up until the release of iOS 9, Applebot ran entirely on four Mac Pro's in an office. Those four Mac Pro's could crawl close to 1B web pages a day.
- In it's first week of existence, it nearly took Apple's internal DNS servers offline. It was then modified to do it's own DNS resolution and caching, fond memories...
Source: I worked on the original version.
Unlike other languages, Go bypasses system's DNS cache, and goes directly to the DNS server, which is a root cause of many problems.
Since netgo is faster, by default Go will try its best to determine if it must use netcgo by parsing /etc/nsswitch.conf, looking at the tld, reading env variables, etc..
If you're building the code you can force it to use netcgo by adding the netcgo build tag.
If you're an administrator the least intrusive method I think would be setting LOCALDOMAIN to something or '' if you can't think of anything which will force it to use NSS.
If you're on a system with cgo available, you can use `GODEBUG=netdns=cgo` to avoid making direct DNS requests.
This is the default on MacOS, so if it was running on four Mac Pro's I wouldn't expect it to be the root cause.
To make DNS resolution really scale, we ended up moving all the DNS caching and resolution directly into Go. Not sure that's how you'd do it today, I'm sure Go has changed a lot. Building your own DNS resolver is actually not so hard with Go, the following were really useful:
As I understand it, Go and Java are both trying to avoid FFI and calling out to system libs for name resolution.
I tend to always offer a local caching resolver available over a socket.
Considering the timeline, are those Trash Can Mac Pro? Or was it the old Cheese Grater ?
The scale of web stuff sometimes surprises me. 1B web pages sounds like just about the daily web output of humanity? How can you handle this with 4 (fast) computers?
The HTTP HEAD method requests the headers that would be returned if the HEAD request's URL was instead requested with the HTTP GET method. For example, if a URL might produce a large download, a HEAD request could read its Content-Length header to check the filesize without actually downloading the file.
When you request a page, the response includes metadata, usually including a Last-Modified timestamp and often including an ETag (entity tag). Then when you make subsequent requests, you can include these in If-Modified-Since and If-None-Match request headers.
If the resource hasn’t changed, then the server responds with 304 Not Modified instead of sending the resource all over again. If the resource has changed, then the server sends it straight away.
Doing it this way means that in the case where the resource has changed, you make one request instead of two, and it also avoids a race condition where the resource changes between the HEAD and the GET requests.
Does it use a traditional relational database or another existing database-like product? Or is built from scratch just sitting on top of a file system.
Also interested at the higher end of the scale about how 'hard'/polite you should be with authoritative nameservers as some of them can rate limit also.
Apple's corporate network also had incredible bandwidth to the Internet at large. Not sure why, but I assumed it was because their earliest data centers actually ran in office buildings in the vicinity of 1 Infinite Loop.
Apple's server stack has been primarily Java for about 20 years.
Great use case for Go though, especially its concurrency features for web crawlers. I reckon Scala could work too, although it's a lot more complicated / clever.
The book is freely available online at https://nlp.stanford.edu/IR-book/information-retrieval-book....
I guess there are not many places where you can easily get 4GB/s sustained throughput from a single office (especially with proxy servers and firewalls in front of it). Is that standard at Apple or did the infrastructure team get involved to provide that kind of bandwidth?
All three uses of "it's" should be "its".
And I would just write "Mac Pros" instead of Mac Pro's".