Hacker News new | past | comments | ask | show | jobs | submit login
Applebot, the web crawler for Apple (support.apple.com)
91 points by killwhitey on May 6, 2015 | hide | past | favorite | 90 comments

“It looks like someone at Apple is running a web crawler written in Go.”

Likely explains [1] from last year (see [2] for HN thread).

[1] http://jan.moesen.nu/2014/11/06/apple-crawler.txt

[2] https://news.ycombinator.com/item?id=8567205

nice catch!

I remember some thread from the Safari's early ages where they hid the user-agent, to avoid raising attention.

>If robots instructions don't mention Applebot but do mention Googlebot, the Apple robot will follow Googlebot instructions.

So if I set in my robots.txt to disallow all bots except Googlebot, Applebot will index anyway? I don't think I like that precedent.

I've said it before, and I'll say it again: blocking all but certain bots is the best way to block innovation. It's bad for you, it's bad for the ecosystem. Bad for the ecosystem because incumbents that want to respect robots.txt to propose new services will have an unfair disadvantage. It's bad for you because it'll just give more power to Google et al regarding your incoming traffic (and you'll have to follow their rules, like every SEO is doing right now).

But that is a choice that a website operator has the authority to make given current standards. Apple is ignoring the wishes of a website operator and piggybacking on the trust that a website operator has place in GoogleBot.

Note that serving bots, especially for media-rich sites, eats heavily into a finite resource.

For those of us whose resources are especially tight, blocking Yandex, Baidu and MSN may be very helpful, your ideals notwithstanding.

Putting anything on the internet exposes your "finite resources" to attackers. I'm sorry, but robots.txt is just a courtesy offered to you, and if you're doing serious business, you shouldn't rely on it. Use authentication, rate-limiting, captchas, DDoS-protecting CDNs, and enforce the limitations at the source, don't rely on nice people respecting your robots.txt.

I wonder what are the legalities of this type of discrimination? Retail businesses aren't allowed to arbitrarily refuse service, they have to follow certain rules and be consistent.

Is there legal backing to enforce obeying robots.txt, or is it a guideline?

There are some court cases that involve robots.txt but I'm not sure because there was a lot of reading to do and I was hoping someone who knew would just come along and summarize it for us :)

I actually loved this part of the page being discussed.

The whole idea of versioning content based on who accesses it is broken and fundamentally at odds with the idea of the open web. Same goes for user-agent string madness, by the way. Yes, we should be able to tell robots from humans, but otherwise, it's supposed to be the Web.

Incidentally, this hits close to a pain point: I find it extremely annoying when publishers (like Elsevier) hide content behind a paywall, but still expose it to Googlebot for indexing. The result is that you are able to find a scientific article, which is not accessible (but Googlebot has cached snippets). This goes against Google's own guidelines (they used to tell people that Googlebot must not see different content from browsers). And it goes against the whole idea of the Web: if you want to hide stuff behind a paywall, do so — but then it is no longer accessible.

Going back to Applebot, I love the fact that they will now follow Googlebot instructions. Hopefully people will stop distinguishing who accesses content.

I remember a big issue with user-agent filtering when Google was still trying to make Google TV work. At the time, loads of TV networks had free, full streaming episodes of shows on their websites as a way to capture some ad revenue that would be lost if people were torrenting or whatever in order to see shows they missed. Google TV was attempting to list those episodes alongside whatever was currently live on TV via your cable/sat/antenna and stuff from other sites like Youtube.

The idea was that instead of having to go to all sorts of places to find content, it would show you what was available at a given time based on what you were looking for.

...and then all of the network sites and the free Hulu stuff got put behind a user agent filter and essentially wiped out a huge part of GTV's reason for existence. The goal was to bring all of the free content into one place but the networks didn't want you watching a free stream in lieu of a cable broadcast. They wanted you to watch cable on your living room TV and only use the free streaming episodes from the computer in your office as a backup.

Same goes for services where web viewing/listening is free but if you try to access it from a mobile web browser, you have to either fool the site or subscribe to some mobile version.

That's maybe jumping to conclusion.

I interpreted the sentence less literally and more like "in absence of rule, default to GoogleBot ones"

So if you put a wildcard rule forbidding access and a specific one allowing access to GoogleBot, AppleBot will honour the wildcard one.

That's how I would have coded it anyway: parse the rule for current agent string, if no rule applies, run it again with GoogleBot one before assuming that website does not contain restrictions.

"If robots instructions don't mention Applebot"

I would assume this means that it will follow GoogleBot unless you specifically mention AppleBot by name and not by using a wildcard.

So a User-agent: * would be ignored if a User-agent: GoogleBot is found.

Serious question, because I can't imagine your use case - under what circumstances would you wan't to block all bots except one?

Facebook blocks all bots except a select few to prevent site scraping - https://www.facebook.com/robots.txt

I'm not an expert but my guess is: limiting bot traffic, but keeping the site available for the most popular search engine.

My experience is that the worst bots don't respect robots.txt anyway.

Getting crawled by the major search engines typically isn't that bad, they tend to know what they're doing. Getting hammered by some crappy local search engine is what's annoying.

We don't limit any bots, except once where we completely blocked Eniro in our firewall. Google, Bing and a ton of other could index at the same time, with no issue. Eniro for some reason decided to just index way to much at once, no reaction to robots.txt and no reply from the email they so kindly included in the headers.

But I see your point, it's just a bit sad when Google has become "The Internet".

I thought FB was the internet. Googlebot is just the Kleenex of indexers.

Owner of GOOG stock, maybe?

Maybe for ethical reasons or some kind of exclusive agreement with one particular search vendor?

Is it of any importance?

That's what you get when many people code with only one company in mind. It happened in the past and it is happening again, whether we like it or not. Even Microsoft was kinda masquerading early versions of Edge as Chrome.

"Even Microsoft was kinda masquerading early versions of Edge as Chrome"

I would assume that was more to keep the tech press from seeing it.

I don't know about the not public betas, but as soon as there was a public Windows 10 TP and you could swap IE's engine to Edge I'm pretty sure they had "Edge" at the end of the user-agent, but they had the rest of the Chrome UA so most of the websites out there considered it as being Chrome.

I may also be completely wrong :)

Why? Applebot is attempting to adhere to the spirit of robots.txt rather than the letter. If the site owner cares otherwise, don't just explicitly name one bot. Is `user-agent: *bot` valid?

You'd prefer they didn't honor robots.txt at all?

Apple is making a smart move by not wanting to depend on anyone else.

Results may suck in the beginning, but well, competing in hard stuff is hard, this is another apple maps. Hopefully they won't get bashed so hard since this is not so user facing.

Apple is making a smart move by not wanting to depend on anyone else.

That's one way of looking at it. A different one would be that the wealthiest company in the world could work with practically anyone and get a better product than they could build themselves more quickly with more features. The idea that they have to do everything in-house to get the best is paranoid and stupid.

For example, rather than build AppleBot, why couldn't they pump a few billion into DuckDuckGo to get use of DuckDuckBot? Or fund archive.org to access their index? Or buy in to commoncrawl.org?

It would be possible for Apple to use its fortune to benefit both their customers and the world. Google and Bing are not the only options.

They could, and they have done it in the past, even with google. But if they can afford it, why depend on anyone else?

I'm pretty sure they're not interested in quickly, or more features. Their users can still use other products (and lets be honest, most products WANT to be on the apple platforms), but at least this way they make sure their users aren't left stranded if those products cease to exist and/or are not updated. Remember the maps situation?

They are making sure their users have core features without having to depend on others' good will.

They could buy another company, sure, but I wouldn't count on DDG being ready to sell out, and besides, getting new people to work on their new thing is probably easier/better organisation-wise than to on-board a different company/organisation with a lot of baggage.

DDG would be a perfect fit for Apple, IMHO, as long as they don't kill the company and leave them some autonomy.

Is DuckDuckBot for real? While I'm sure they are diversifying, DDG's results are still based on Bing.

As far as I know DuckDuckGo has mostly been using Yandex's results lately. DuckDuckBot is a real thing (and has been for a long time), but it doesn't do much except power their anti-spam signals.

I am hoping this will change the tide and trend on "Considering Google alone as the master in search". Pretty positive apple engineer will come up with good algorithms to keep them in par.

This is interesting.

A while back, I think either Cook or Jobs mentioned that Apples makes PRODUCTS and doesn't sell ADS.

If that's true (and stays true) AND this is the beginning of a search engine for them, it's going to be VERY interesting to see what it looks like.

FWIW, an Apple recruiter approached me about working on a new Search-related thing. He used the following enticement in the initial email: "We are building the future of search for the best user experience (unadulterated by advertising for the first time in history)." So as best I can tell, a search experience without advertising is very much on their mind.

Hey, I don't think it's appropriate to post publicly information that somebody divulged to you in confidence.

I don't think recruitment spam really counts as "in confidence." Certainly he's under no NDA. I guess some might consider it rude, but I wouldn't.

He said it was the recruiter's initial e-mail. If someone e-mails you out of the blue with details like that I think you're more than allowed to tell others about it.

I don't think email is considered confidential by default. Had this recruiter made any confidentiality request, I would have tried my best to honor it. Instead, he seemed more interested in spreading the word that they were entering the search game. Also, Apple has not been exactly hiding their growing interest in search. They rarely let their engineers speak in public, but they were on stage this year at Lucene Revolution giving a number of details about how they are using Solr.

It's not clear-cut, that's for sure. I just wanted to post a different perspective.

Thing is, if you asked the sender for permission to post pieces of their email, they'd probably say no. It seems a bit gauche to say posting is okay because "nobody told me not to."

It's a good point and since journalists are already asking about it, I now wish I hadn't even posted it. I think Apple is not trying to hide the fact that they are looking for search and machine learning people, but the press will surely get it wrong trying to triangulate a vague one-liner from a recruiting email.

Apple already sells ads, and in fact restricts some technology to advertising partners. WebGL, full screen ads etc. They also have a patent on unskippable ads.

I don't know why people don't treat statements like that with enough cynicism.

They made iAd so they would control the major advertising network on iOS and as a result could limit the privacy implications. iAd is by far the least intrusive major advertising network.

WebGL is no more restricted technology in iOS 8+. Btw it was restricted only in Safari.

Sure, but I find it quite unfair to say Apple doesn't sell Ads when they not only own their own exclusive Ad network, they control their entire platform, own many ad related patents and actively restricted access to new technologies to their advertisers in order they could outcompete anything else on iOS.

They don't outcompete anyone on iOS. Advertisers hate iAds because they don't have as much access to user data as they do with Adwords etc.

I think this is more about AI and offering you answers to questions like Siri does... It will not list any links. Launching a new search engine has no benefits for Apple from my point of view.

This is more into AI than Ads. AFAIK Siri uses Wolfram Alpha for its answers. Given the way Google's knowledge graph is advancing, it totally makes sense for them to build their own AI engine.

Google Knowledge Graph is based on Freebase.com. Google shut it down, last month. IBM relied on Freebase for Watson, so they acquired Blekko last month, a real web search engine & knowledge base startup. Microsoft already owns Powerset company that powers their Cortana. Apple needs a up-to-date knowledge base too, currently they rely on WolframAlpha and Freebase, afaik.

Shutting down Freebase was a big hit for many projects in the AI space, Freebase had 2,903,361,537 facts in comparision Wikimedia's Wikidata has just 13,924,224 facts - that's still a huge difference.

http://en.wikipedia.org/wiki/Freebase , http://en.wikipedia.org/wiki/Knowledge_Graph , http://en.wikipedia.org/wiki/Blekko , http://en.wikipedia.org/wiki/Powerset_(company) , http://en.wikipedia.org/wiki/Knowledge_Vault , http://venturebeat.com/2015/03/27/ibm-acquires-web-crawling-...

I'm assuming you are remembering Cook's open letter on privacy http://www.apple.com/privacy/

"Our business model is very straightforward: We sell great products. We don’t build a profile based on your email content or web browsing habits to sell to advertisers."

This obviously is not saying Apple doesn't sell ads. It's saying their core business model is selling products. The users are not the product.

Note does it doesn't say they don't build any profiles at all, or use your behavior for their own purposes, or that they don't collect data. It is in fact, very likely they do use profiles of their users for internal purposes, even if just for improving those services.

It does not say they don't collect data thought. And I'm pretty sure not even Google sells data to advertisers, they sell ad space. It's just a PR stunt.

But Apple does sell ads...

And while I would love a privacy focused search engine, I just don't think you can build a good one without private data.

iAd Advertising with Apple http://advertising.apple.com/

As others have commented, I have fears about iAd. I hope Apple will refrain from that business completely.

Please let this be the beginning of an Apple search engine. We really need some better alternatives to Google.

I almost exclusively use Duck Duck Go (https://duckduckgo.com/) as an alternative to Google. The only time it doesn't give me the answers I need is occasionally for code / bug related searches, in which case I go back to Google.

I have also noticed the more detailed a query for technical issues, bugs or combinations of software, I have to rely on Google as well. My workflow is similar to yours. I silently hope this is just a matter of DDG improving over time.

But truth to be told, DDG is so handy with the shebang keywords, I search Google using it and only leave DDG in my Firefox search bar and specialize with !g !gv !gi as needed.

Actually, yeah, whenever I need to search for something code related I don't go to Google, I just redo the DDG search with !g at the end. The shebangs are incredibly useful for that.

second this. I've using duckduckgo for about two years so far, both in dekstop and mobile. This is a really google replacement, and we can search other sites directly with the shortcuts ..

github? !gh

wikipedia? !w

youtube? !yt

google? !g

images? !i

and there are tons of them!!

I'd rather have a decentralized alternative ala bitcoin then another mega corp search engine.

EDIT: on that point. It would be very neat but probably very difficult to build a search index based on a agreed upon algorithm like bitcoin. There would be a need for some sort of voting system to update the algorithm and moderate people gaming it.

Yacy[1] is a decentralized search engine

[1] http://www.yacy.net/en/

For the privacy-aware, Apple used to be a good alternative for Google, but I'm afraid that soon this will no longer be the case.

What are you basing this statement on?

Google's businessmodel is based on mining and "selling" your data (you are not the customer).

Apple's businessmodel is based on selling real products.

Apple still collects data on you. Google knows what sites you browse on the Web and what Web sites you search for. Apple knows what apps you search for in the App Store and what you buy in the store, and probably usage statistics. At some point, if not already, that data will be used to profile you and generate revenue maximizing suggestions of goods and services to buy in the App Store.

You're completely missing the point.

Google makes 90% of their revenue from advertising.

Apple makes <5% of their revenue from App Store sales.

So you're ok with your data being collected, as long as they don't make money of it?

Sorry, but I hope you agree that one's web browsing behavior is a gazillion times more interesting than one's app-usage behavior.

My search queries start with Siri and end in Safari

Apple knows my web browsing behavior because Safari is my default mobile browser app syncing my history, reading list and bookmarks with desktop Safari.

Apple knows where I work, live and my travels via Apple Maps and Passbook

Apple knows a few of my purchases via ApplePay both in the real world transactions and via ApplePay in apps.

I used MobileMe mail before switching to GMail and have considered switching to iCloud mail.

Apple knows a lot more interesting things about me than Google.

Search is a dirty business - in the sense that you have to index everything, even the content you don't like, in order to be truly useful (with he common sense exceptions for scummy, abhorrent and/or illegal content).

I could be wrong, but I think Apple might have an ideological problem with a lot of content. It's their call if they decide to filter that stuff out, but it's censorship, and I struggle to see how that would result in a "better" search engine.

EDIT: Applebot still has uses beyond a search engine though. I think Apple are being straightforward in it's explanation. It's for Siri and Spotlight.

Well with the control freaks that are apple, we know better than you approach and the love me in the curator/gatekeeper spot ideology, seems like the only search engine apple can produce is yahoo circa 1998. Also - no porn and torrents on it.

In what way do you imagine Apple will be better?

My question is: why do we need an alternative really? What is better when even more companies index everything and perhaps screw with your privacy? My intuition agrees with you though. More companies == more competition. At some point privacy might be a selling point for them.

On the other hand: it is kind of crazy that when I read about "indexing the web" it starts ringing all kind of privacy bells, while apple's incentive might not even be to violate people's privacy.

When one company have more than 90% market share then: Yes, we do need an alternative. (Google market in search has been reported to be as high as 94% in countries like Denmark).

If a company manages to upset Google then that company ceases to exists online, regardless of the validity of Google reason to blacklist that company. Ideally no search engine would be above 20% market share (20% being a random low number I just made up).

Right now websites and marketing material/money is directed at Google exclusively, making it continually harder for new search companies to succeed.

I agree with you. Google has too much power over small webshops. If you somehow come in bad standing at Google you can just shut down your operation, because you won't make much money.

It's nice to have alternatives for something that's so fundamental to how we use the internet.

Google works well today, but what about 5 years from now, or 10 years from now? What if you need to be signed into Google+ to Google search? What if your Google+ account needs to be tied to a mobile phone? What if Google has limited search, and you need to own a Chromebook or Android device for the full search experience and viewing more than 10 results?

Competition keeps them in check from doing anything outrageous, and then if they decide to anyway, we have other choices we can migrate towards.

Why would they abandon free advertising income, there's yet 3 billion+ people to learn how to use the internet, that's a lot of money son.

The main problem with Google is not that it indexes everything, but that it indexes everyone - since almost everyone (at least in US and Europe) uses Google. A strong competitor would limit Google's reach in this regard. In this particular case we could assume Google would know far less about iPhone owners. In longer term Google (and others) would have adapt to that limited reach: either by forcing its way onto iPhone somehow, or - hopefully - by not basing its offer on its vast knowledge about users.

> please contact us at: Apple-NOC “at” apple.com

Can't Apple build itself some spam protection??!? Search is harder than this.

Based on logs it would seem that the crawler, or at least parts of it are written in golang. When following some redirects the useragent would be cleared and would identify itself as golang.

Introducing Apple Web Search. Only available on iOS and OS X devices.

>If robots instructions don't mention Applebot but do mention Googlebot, the Apple robot will follow Googlebot instructions.

This is nice to see for a change. There are a lot of search engine bots out there and forgetting a a lot of them is easy to do.

Will this be more focused on improving search for digital products (App Store) or more toward creating a comprehensive catalog for digital/physical products to index + download/purchase?

Apple trying to get do what Google does faster, than Google doing what Apple does.

In other words: Both Apple and Google in the mobile OS and device space. Both Apple and Google in the (mobile) search space.

Has anyone really been far even as decided to use even go want to do look more like?

small-letter footprint:

    Risks are inherent in the use of the Internet

I wonder if Apple Search will be as close to Google Search as Apple Maps is to Google Maps.

That is to say, not very close at all.

If you can't see the expected results, you are probably searching it the wrong way! But, it's so shiny. Yes, but all the results seem Apple specific?

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact