Hacker News new | past | comments | ask | show | jobs | submit login

Some fun facts:

- Applebot was originally written in Go (and uncovered a user agent bug on redirects, revealing it's Go origins to the world, which Russ Cox fixed the next day).

- Up until the release of iOS 9, Applebot ran entirely on four Mac Pro's in an office. Those four Mac Pro's could crawl close to 1B web pages a day.

- In it's first week of existence, it nearly took Apple's internal DNS servers offline. It was then modified to do it's own DNS resolution and caching, fond memories...

Source: I worked on the original version.




> It was then modified to do it's own DNS resolution and caching, fond memories...

Unlike other languages, Go bypasses system's DNS cache, and goes directly to the DNS server, which is a root cause of many problems.


This is true but a little misleading. On Windows Go uses GetAddrInfo and DNSQuery which does the right thing. But on Linux there are two options: netgo and netcgo -- a pure Go implementation that doesn't know about NSS, and a C wrapper that uses NSS.

Since netgo is faster, by default Go will try its best to determine if it must use netcgo by parsing /etc/nsswitch.conf, looking at the tld, reading env variables, etc..

If you're building the code you can force it to use netcgo by adding the netcgo build tag.

If you're an administrator the least intrusive method I think would be setting LOCALDOMAIN to something or '' if you can't think of anything which will force it to use NSS.


Yeah, I've never had to implement my own DNS cache for a language before...

If you're on a system with cgo available, you can use `GODEBUG=netdns=cgo` to avoid making direct DNS requests.

This is the default on MacOS, so if it was running on four Mac Pro's I wouldn't expect it to be the root cause.


It's possible that wasn't the default setting on Macs back then. I don't know that cgo would be a good choice either, if you're resolving a ton of domains at once. Early versions of Go would create new threads if a goroutine made a cgo call, and an existing thread was not available. I remember this required us to throttle concurrent dial calls, otherwise we'd end up with thousands of threads, and eventually bring the crawler to a halt.

To make DNS resolution really scale, we ended up moving all the DNS caching and resolution directly into Go. Not sure that's how you'd do it today, I'm sure Go has changed a lot. Building your own DNS resolver is actually not so hard with Go, the following were really useful:

https://idea.popcount.org/2013-11-28-how-to-resolve-a-millio...

https://github.com/miekg/dns


And Java.

As I understand it, Go and Java are both trying to avoid FFI and calling out to system libs for name resolution.

I tend to always offer a local caching resolver available over a socket.


>- Up until the release of iOS 9, Applebot ran entirely on four Mac Pro's in an office. Those four Mac Pro's could crawl close to 1B web pages a day.

Considering the timeline, are those Trash Can Mac Pro? Or was it the old Cheese Grater ?


Trash cans :)


>Up until the release of iOS 9, Applebot ran entirely on four Mac Pro's in an office. Those four Mac Pro's could crawl close to 1B web pages a day.

The scale of web stuff sometimes surprises me. 1B web pages sounds like just about the daily web output of humanity? How can you handle this with 4 (fast) computers?


Computers are very fast. We just tend to not notice because today's software is obese.


Yes, let's all run separate web browsers as the application and run our own JavaScript inside our browser. Who cares if there's 5 other "apps" doing exactly the same!

Insanity.


Multiple tabs/browser windows is similar and generally not an issue.


I think they were referring more to apps like Slack and other similar JS browser/ JS based apps which run separate from the browser. Maybe I'm being generous? Slack is certainly itself a beast.


Yes this is precisely what I meant. eg. Electron apps, Slack, VS Code, Skype etc. etc. ad nauseum


Doesn't it depend on a lot of things? For example you can only do head requests to see if a page changed since a given timestamp. If not then there is no need to process it.


For anybody wondering how:

The HTTP HEAD method requests the headers that would be returned if the HEAD request's URL was instead requested with the HTTP GET method. For example, if a URL might produce a large download, a HEAD request could read its Content-Length header to check the filesize without actually downloading the file.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/HE...


Typically you wouldn’t bother with a HEAD request, you’d do a conditional GET request.

When you request a page, the response includes metadata, usually including a Last-Modified timestamp and often including an ETag (entity tag). Then when you make subsequent requests, you can include these in If-Modified-Since and If-None-Match request headers.

If the resource hasn’t changed, then the server responds with 304 Not Modified instead of sending the resource all over again. If the resource has changed, then the server sends it straight away.

Doing it this way means that in the case where the resource has changed, you make one request instead of two, and it also avoids a race condition where the resource changes between the HEAD and the GET requests.


Do a lot of random pages return etags? I've only ever seen them in the AWS docs for boto3


nginx sends it by default for static files (example: hacker news [0]), I assume other web servers do too.

[0] https://news.ycombinator.com/y18.gif


I am particular curious about data storage.

Does it use a traditional relational database or another existing database-like product? Or is built from scratch just sitting on top of a file system.


Nope, you don't really need a database. What you need for fast, scalable web crawling is more like key-value storage: a really fast layer (something like RocksDB on SSD) for metadata about URL's, and another layer that can be very slow for storing crawled pages (like Hadoop or Cassandra). In reality, writing directly to Hadoop/Cassandra was too slow (because it was in a remote data center) so it was easier to just write to RAID arrays over Thunderbolt, and sync the data periodically as a separate step.


Interesting stuff. I've used libcurl to crawl at that kind of pace, is the parsing/indexing separate from that count per day? Also interested in how you dealt with DNS and/or rate limiting


I've done similar at smaller scale. Instead of messing with underlying DNS or other caching in our code we just dropped a tuned dnsmasq as the resolver in front. The crawler had a separate worker to fill hosts so it was mostly hot when the crawler was asking.


In my case I was just fetching the home page of all known domain names. First issue I'd noticed was ensuring DNS requests were asynchronous. I wanted to rate limit fetching per /28 IPv4 to respect the hosts getting crawled, but couldn't really do that without knowing the IP beforehand (and keeping the crawler busy) so ended up creating a queue based on IP. I used libunbound. Found that some subnets have hundreds of thousands of sites and although the crawl starts quickly, you end up rate limited on those.

Also interested at the higher end of the scale about how 'hard'/polite you should be with authoritative nameservers as some of them can rate limit also.


Roughly estimating, each Mac Pro could crawl around 3k pages per second.


Which is not possible


Say the average web page is 100kb, and assuming gigabit connection in the office, then that's about a thousand pages per second. If the office switch is on 10gbit that would work out to 4000p/s naively counting. But we're in the same order of magnitude for the speed even on gbit, and we're not accounting for gzip, and the actual average page size might be a bit lower too.


Everything was on 10gigE. The average page size was around 17KB gzipped. Everything's a careful balance between CPU, memory, storage, and message throughput between machines.

Apple's corporate network also had incredible bandwidth to the Internet at large. Not sure why, but I assumed it was because their earliest data centers actually ran in office buildings in the vicinity of 1 Infinite Loop.


The average is a lot closer to 25KB IIRC, gzipped


Why not?


Can you share some more details about the current state? Is it still written in Go?


No idea, it's been years since I last worked on it. It was also not the only Go service written at Apple (90% of cloud services at Apple were written in Java), though it may have been the first one used in production.


And I sit here kind of shocked that Apple would use Java for anything, backend or not. I thought Apple had a strong preference for owning its own tech stacks, whether that be ObjC/WebObjects or later Swift...


I think WebObjects was supporting Java even before it came to Apple from Next. In the early days, many of Apple's services built with WebObjects even ran on Sun server hardware, and XServe's. But nowadays it's all commodity Linux hardware, like you would find in any data center.


WebObjects has been fully Java since version 5 was released in 2001: https://en.wikipedia.org/wiki/WebObjects#WOWODC

Apple's server stack has been primarily Java for about 20 years.


Not sure why you'd be shocked, it's a solid language for enterprise services like Apple offers, and their other languages - C/C++, Objective-C, Swift - aren't very kind for web services.

Great use case for Go though, especially its concurrency features for web crawlers. I reckon Scala could work too, although it's a lot more complicated / clever.


Out of curiosity, why would C or C++ not be good for web services?


I would guess because the input sanitizing requirement is harder for the web; having a stackoverflow when running locally requires the attacker to execute locally -- having a use-after-free from port 80 would be a much wider audience


Some Apple services were written in C/C++. One downside is it's very hard to source engineers across the company who can then work on that code, or for those engineers to go work on other teams.


Apple employs the founder of the Netty project, who has given plenty of open talks about Apple’s use of Netty (which implies Java services). Same is true for Cassandra.


Apple had a very odd obsession with Java right after the NeXT purchase. WebObjects got converted and they tried to do a Java Cocoa. Both were worse than the original.


At the cocoa heads user group I heard that ruby is very popular for their services more recently.


Can you talk more about the specific? What kind of parsers did you guys use? How about storage? How often did you update pages?


You should check out Manning's "Introduction to Information Retrieval", it has far more detail about web crawler architecture than I can write in a post, and served as a blueprint for much of Applebot's early design decisions.


Nice, thanks for the recommendation!

The book is freely available online at https://nlp.stanford.edu/IR-book/information-retrieval-book....


With 1b pages per day I guess you needed 1gbit/s connections on each of those machines? Especially if they also wrote back to centralized storage.

I guess there are not many places where you can easily get 4GB/s sustained throughput from a single office (especially with proxy servers and firewalls in front of it). Is that standard at Apple or did the infrastructure team get involved to provide that kind of bandwidth?


Do you have a timeline of how AppleBot has evolved?


Was that including the ability to render js driven asynchronously loaded pages, including subsequent XHR requests? If so, it's beyond impressive.


Thanks for sharing mate. That is amazing insights!


Why did you leave Apple?


Sorry to be pedantic, but your misuse of apostrophes in an otherwise perfect text annoys me.

All three uses of "it's" should be "its".

And I would just write "Mac Pros" instead of Mac Pro's".




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: