- Applebot was originally written in Go (and uncovered a user agent bug on redirects, revealing it's Go origins to the world, which Russ Cox fixed the next day).
- Up until the release of iOS 9, Applebot ran entirely on four Mac Pro's in an office. Those four Mac Pro's could crawl close to 1B web pages a day.
- In it's first week of existence, it nearly took Apple's internal DNS servers offline. It was then modified to do it's own DNS resolution and caching, fond memories...
This is true but a little misleading. On Windows Go uses GetAddrInfo and DNSQuery which does the right thing. But on Linux there are two options: netgo and netcgo -- a pure Go implementation that doesn't know about NSS, and a C wrapper that uses NSS.
Since netgo is faster, by default Go will try its best to determine if it must use netcgo by parsing /etc/nsswitch.conf, looking at the tld, reading env variables, etc..
If you're building the code you can force it to use netcgo by adding the netcgo build tag.
If you're an administrator the least intrusive method I think would be setting LOCALDOMAIN to something or '' if you can't think of anything which will force it to use NSS.
It's possible that wasn't the default setting on Macs back then. I don't know that cgo would be a good choice either, if you're resolving a ton of domains at once. Early versions of Go would create new threads if a goroutine made a cgo call, and an existing thread was not available. I remember this required us to throttle concurrent dial calls, otherwise we'd end up with thousands of threads, and eventually bring the crawler to a halt.
To make DNS resolution really scale, we ended up moving all the DNS caching and resolution directly into Go. Not sure that's how you'd do it today, I'm sure Go has changed a lot. Building your own DNS resolver is actually not so hard with Go, the following were really useful:
>Up until the release of iOS 9, Applebot ran entirely on four Mac Pro's in an office. Those four Mac Pro's could crawl close to 1B web pages a day.
The scale of web stuff sometimes surprises me. 1B web pages sounds like just about the daily web output of humanity? How can you handle this with 4 (fast) computers?
Yes, let's all run separate web browsers as the application and run our own JavaScript inside our browser. Who cares if there's 5 other "apps" doing exactly the same!
I think they were referring more to apps like Slack and other similar JS browser/ JS based apps which run separate from the browser. Maybe I'm being generous? Slack is certainly itself a beast.
Doesn't it depend on a lot of things? For example you can only do head requests to see if a page changed since a given timestamp. If not then there is no need to process it.
The HTTP HEAD method requests the headers that would be returned if the HEAD request's URL was instead requested with the HTTP GET method. For example, if a URL might produce a large download, a HEAD request could read its Content-Length header to check the filesize without actually downloading the file.
Typically you wouldn’t bother with a HEAD request, you’d do a conditional GET request.
When you request a page, the response includes metadata, usually including a Last-Modified timestamp and often including an ETag (entity tag). Then when you make subsequent requests, you can include these in If-Modified-Since and If-None-Match request headers.
If the resource hasn’t changed, then the server responds with 304 Not Modified instead of sending the resource all over again. If the resource has changed, then the server sends it straight away.
Doing it this way means that in the case where the resource has changed, you make one request instead of two, and it also avoids a race condition where the resource changes between the HEAD and the GET requests.
Does it use a traditional relational database or another existing database-like product? Or is built from scratch just sitting on top of a file system.
Nope, you don't really need a database. What you need for fast, scalable web crawling is more like key-value storage: a really fast layer (something like RocksDB on SSD) for metadata about URL's, and another layer that can be very slow for storing crawled pages (like Hadoop or Cassandra). In reality, writing directly to Hadoop/Cassandra was too slow (because it was in a remote data center) so it was easier to just write to RAID arrays over Thunderbolt, and sync the data periodically as a separate step.
Interesting stuff. I've used libcurl to crawl at that kind of pace, is the parsing/indexing separate from that count per day? Also interested in how you dealt with DNS and/or rate limiting
I've done similar at smaller scale. Instead of messing with underlying DNS or other caching in our code we just dropped a tuned dnsmasq as the resolver in front. The crawler had a separate worker to fill hosts so it was mostly hot when the crawler was asking.
In my case I was just fetching the home page of all known domain names. First issue I'd noticed was ensuring DNS requests were asynchronous. I wanted to rate limit fetching per /28 IPv4 to respect the hosts getting crawled, but couldn't really do that without knowing the IP beforehand (and keeping the crawler busy) so ended up creating a queue based on IP. I used libunbound. Found that some subnets have hundreds of thousands of sites and although the crawl starts quickly, you end up rate limited on those.
Also interested at the higher end of the scale about how 'hard'/polite you should be with authoritative nameservers as some of them can rate limit also.
Say the average web page is 100kb, and assuming gigabit connection in the office, then that's about a thousand pages per second. If the office switch is on 10gbit that would work out to 4000p/s naively counting. But we're in the same order of magnitude for the speed even on gbit, and we're not accounting for gzip, and the actual average page size might be a bit lower too.
Everything was on 10gigE. The average page size was around 17KB gzipped. Everything's a careful balance between CPU, memory, storage, and message throughput between machines.
Apple's corporate network also had incredible bandwidth to the Internet at large. Not sure why, but I assumed it was because their earliest data centers actually ran in office buildings in the vicinity of 1 Infinite Loop.
No idea, it's been years since I last worked on it. It was also not the only Go service written at Apple (90% of cloud services at Apple were written in Java), though it may have been the first one used in production.
And I sit here kind of shocked that Apple would use Java for anything, backend or not. I thought Apple had a strong preference for owning its own tech stacks, whether that be ObjC/WebObjects or later Swift...
I think WebObjects was supporting Java even before it came to Apple from Next. In the early days, many of Apple's services built with WebObjects even ran on Sun server hardware, and XServe's. But nowadays it's all commodity Linux hardware, like you would find in any data center.
Not sure why you'd be shocked, it's a solid language for enterprise services like Apple offers, and their other languages - C/C++, Objective-C, Swift - aren't very kind for web services.
Great use case for Go though, especially its concurrency features for web crawlers. I reckon Scala could work too, although it's a lot more complicated / clever.
I would guess because the input sanitizing requirement is harder for the web; having a stackoverflow when running locally requires the attacker to execute locally -- having a use-after-free from port 80 would be a much wider audience
Some Apple services were written in C/C++. One downside is it's very hard to source engineers across the company who can then work on that code, or for those engineers to go work on other teams.
Apple employs the founder of the Netty project, who has given plenty of open talks about Apple’s use of Netty (which implies Java services). Same is true for Cassandra.
Apple had a very odd obsession with Java right after the NeXT purchase. WebObjects got converted and they tried to do a Java Cocoa. Both were worse than the original.
You should check out Manning's "Introduction to Information Retrieval", it has far more detail about web crawler architecture than I can write in a post, and served as a blueprint for much of Applebot's early design decisions.
With 1b pages per day I guess you needed 1gbit/s connections on each of those machines? Especially if they also wrote back to centralized storage.
I guess there are not many places where you can easily get 4GB/s sustained throughput from a single office (especially with proxy servers and firewalls in front of it). Is that standard at Apple or did the infrastructure team get involved to provide that kind of bandwidth?
- Applebot was originally written in Go (and uncovered a user agent bug on redirects, revealing it's Go origins to the world, which Russ Cox fixed the next day).
- Up until the release of iOS 9, Applebot ran entirely on four Mac Pro's in an office. Those four Mac Pro's could crawl close to 1B web pages a day.
- In it's first week of existence, it nearly took Apple's internal DNS servers offline. It was then modified to do it's own DNS resolution and caching, fond memories...
Source: I worked on the original version.