
Google’s robots.txt parser is now open source - dankohn1
https://opensource.googleblog.com/2019/07/googles-robotstxt-parser-is-now-open.html
======
randomstring
Where was this 10 years ago when I was reverse engineering the Google
robots.txt parser by feeding example robots.txt files and URLs into the Google
webmaster tool? I actually went so far as to build a convoluted honeypot
website and robots.txt to see what the Google crawler would do in the wild.

Having written the robots.txt parser at Blekko, I can tell you what standards
there are incomplete and inconsistent.

Robots.txt files are usually written by hand using random text editors ("/n"
vs "/r/n" vs a mix of both!) by people who have no idea what a programming
language grammar is. Let alone follow BNF from the RFC. There are situations
where adding a newline completely negates all your rules. Specifically,
newlines between useragent lines nor between useragent lines and rules.

My first inclination was to build an RFC compliant parser and point to the
standard if anyone complained. However, if you start looking at a cross
section of robots.txt files, you see that very few are well formed.

With the addition of sitemaps, crawl-delay, and other non-standard syntax
adopted by Google, Bing, and Yahoo (RIP). Clearly the RFC is just a starting
point and what ends up on website can be broken and hard to interpret the
author's meaning. For example, the Google parser allows for five possible
spellings of DISALLOW, including DISALLAW.

If you read a few webmaster boards, you see that many website owners don't
want a lesson in Backus–Naur form and are quick to get the torches and
pitchforks if they feel some crawler is wasting their precious CPU cycles or
cluttering up their log files. Having a robots.txt parser that "does what the
webmaster intends" is critical. Sometimes, I couldn't figure out what some
particular webmaster intended, let alone write a program that could. The only
solution was to draft off of Google's de facto standard.

(To the webmaster with the broken robots.txt and links on every product page
with a CGI arg with "&action=DELETE" in it, we're so sorry! but... why???)

Here's the Perl for the Blekko robots.txt parser.
[https://github.com/randomstring/ParseRobotsTXT](https://github.com/randomstring/ParseRobotsTXT)

~~~
asdfman123
It's an easy fix if Google cared. Have an online tool that validates if the
robots.txt is correct, and send out an announcement that files that don't meet
spec will be penalized in terms of SEO.

~~~
hombre_fatal
That just punishes users by sinking relevant results for reasons users
couldn’t possibly care about.

~~~
PunksATawnyFill
I enjoy the hypocrisy of Google punishing sites that aren't "mobile-friendly,"
and then deliberately disabling ZOOMING on their own mobile sites.

------
jxcl
I've been in disagreements with SEO people quite frequently about a "Noindex"
directive for robots.txt. There seem to be a bunch of articles that are sent
to me every time I question its existence[0][1]. Google's own documentation
says that noindex should be in the meta HTML but the SEO people seem to trust
these shady sites more.

I haven't read through all of the code but it assuming this is actually what's
running on Google's scrapers this section [2] seems to be pretty conclusive
evidence to me that this Noindex thing is bullshit.

[0] [https://www.deepcrawl.com/blog/best-practice/robots-txt-
noin...](https://www.deepcrawl.com/blog/best-practice/robots-txt-noindex-the-
best-kept-secret-in-seo/)

[1][https://www.stonetemple.com/does-google-respect-robots-
txt-n...](https://www.stonetemple.com/does-google-respect-robots-txt-noindex-
and-should-you-use-it/)

[2]
[https://github.com/google/robotstxt/blob/59f3643d3a3ac88f613...](https://github.com/google/robotstxt/blob/59f3643d3a3ac88f613326dd4dfc8c9b9a545e45/robots.cc#L262-L276)

~~~
jxcl
Google is also really generous with how they will let you spell "disallow":
[https://github.com/google/robotstxt/blob/59f3643d3a3ac88f613...](https://github.com/google/robotstxt/blob/59f3643d3a3ac88f613326dd4dfc8c9b9a545e45/robots.cc#L691-L699)

:D

~~~
knorker
I'm not surprised. Some people think _humans_ read robots.txt and get super
angry when the crawler doesn't understand.

~~~
c3534l
I read robots.txt, but I'm not a massive corporation.

------
wybiral
The interesting thing about robots.txt is that there really isn't a standard
for it. This [0] is the closest thing to one and almost every modern website
deviates from it.

For instance it explicitly says "To exclude all files except one: This is
currently a bit awkward, as there is no "Allow" field."

And the behavior is so different between different parsers and website
implementations that, for instance, the default parser in Python can't even
successfully parse twitter.com's robots.txt file because of the newlines.

Most search engines obey it as a matter of principle but not all crawlers or
archivers [1] do.

It's a good example of missing standards in the wild.

[0]
[https://www.robotstxt.org/robotstxt.html](https://www.robotstxt.org/robotstxt.html)

[1] [https://blog.archive.org/2017/04/17/robots-txt-meant-for-
sea...](https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-
engines-dont-work-well-for-web-archives/)

~~~
pbowyer
> The interesting thing about robots.txt is that there really isn't a standard
> for it. This [0] is the closest thing to one and almost every modern website
> deviates from it.

That is changing, and was announced today:
[https://news.ycombinator.com/item?id=20326067](https://news.ycombinator.com/item?id=20326067)

~~~
wybiral
Yeah, my first reaction to Google heading yet another standard was to cringe
but this is one of the situations where I think it makes a lot of sense.
They're dominant in the search industry and most other engines tend to take
their cue so having them spearhead it seems like a good move.

------
simonw
I absolutely understand why they did this, but I have to say I was
disappointed to see only 7 commits at
[https://github.com/google/robotstxt/commits/master](https://github.com/google/robotstxt/commits/master)
dating back to June 25th.

When I read "This library has been around for 20 years and it contains pieces
of code that were written in the 90's" my first thought was "that commit
history must be FASCINATING".

~~~
lucasmullens
I think it's pretty rare for a company to make internal commits public once
making something open source.

~~~
deeteecee
yeah, otherwise you might see commit messages like "wtf is this shit" :)

------
douglasfshearer
> This library has been around for 20 years and it contains pieces of code
> that were written in the 90's.

Whilst I am sure there are good reasons for the omission, it would have been
interesting to see the entirety of the commit history for this library.

~~~
johannes1234321
From a archeological perspective very much.

From Google's perspective it's probably too much work. I would assume this was
a part of the cralwer code and extracted over time into a library, while part
of the monorepo, so changesets probably didn't only touch this code, but also
other parts and this code probably depended on internal libraries (now it
depends on Google's public abseil library) publishing all that needs lots of
review (also considering names and other personal information in commit logs,
TODO comments and their like)

~~~
saagarjha
Not only that, code libraries that weren’t designed to be open source often
have things in them that Google might want to show: codenames, profanity,
calling out specific companies…

~~~
dragonwriter
Also, even if it is authoritatively managed in git _now_ , the whole 20 year
history certainly wasn't (since git is only 14 years old, and Google probably
didn't adopt it on day one), and it's quite likely commit history wasn't
converted,so it's quite possible Google couldn't easily make the whole history
available when publishing it to GitHub even if they wanted to.

~~~
johannes1234321
I assume the authoritative version is still in Google's Piper-based repo and
previously was in perforce and I assume that was for a while ... so if there
were interest Google's could dig deep. But I assume there are other projects
where this is even more interesting. (how ranking changed over time; how
storage formats for the index changed; ...)

------
rasmi
Code here:
[https://github.com/google/robotstxt](https://github.com/google/robotstxt)

~~~
glenneroo
It's also the 5th link "open sourced" in the article.

------
jchw
Note that this is quite strict on what characters may be contained in a bots
user agent. This is due to strictness in the REP standard.

[https://github.com/google/robotstxt/blob/master/robots_test....](https://github.com/google/robotstxt/blob/master/robots_test.cc#L152)

    
    
        // A user-agent line is expected to contain only [a-zA-Z_-] characters and must
        // not be empty. See REP I-D section "The user-agent line".
        // https://tools.ietf.org/html/draft-rep-wg-topic#section-2.2.1
    

So you may need to adjust your bot’s UA for proper matching.

(Disclosure, I work at Google, though not on anything related to this.)

~~~
steventhedev
The strictness is in what may be listed in the robots txt, not the User-Agent
header as sent by bots. the example given in the linked draft standard[0]
makes this abundantly clear that it's on the bot to understand how to
interpret the corresponding lines of robots.txt.

Of course, in practice robots.txt tend to look less like [1] and more like
[2].

[0]: [https://tools.ietf.org/html/draft-rep-wg-
topic#section-2.2.1](https://tools.ietf.org/html/draft-rep-wg-
topic#section-2.2.1)

[1]: [https://github.com/robots.txt](https://github.com/robots.txt)

[2]: [https://wpengine.com/robots.txt](https://wpengine.com/robots.txt)

~~~
jchw
Sorry, I mean for matching, and I did try to imply it was a limitation of the
standard and not the library. Though to avoid confusion, I do personally think
keeping the user agent minimal is wise, since users might have difficulty
guessing what value to use if it differs sufficiently from the real user agent
that's sent.

------
Causality1
I wonder how much noindex contributes to lax security practices like storing
sensitive user data on public pages and relying on not linking to the page to
keep it private. I wonder how much is in the gap between "should be indexed"
and "really ought to restrict access to authorized users only".

~~~
hitpointdrew
I am hoping not much, because that is beyond a horrible "security practice". I
have seen some lazy shit out there, but this would take the cake.

~~~
Causality1
If I recall correctly there was a large company several years ago who tried to
prosecute a whitehat who discovered their user account pages included the
users' e-mail addresses and that changing the address to that of a different
user would drop you right into that user's page with all their personal
information listed.

------
rococode
> how should they deal with robots.txt files that are hundreds of megabytes
> large?

What do huge robots.txt files like that contain? I tried a couple domains just
now and the longest one I could find was GitHub's -
[https://github.com/robots.txt](https://github.com/robots.txt) \- which is
only about 30 kilobytes.

~~~
jedberg
They enumerate every page on the site sometimes specifically for different
crawlers.

Or they have a ton of auto generated pages they don’t want crawled and call
them out individually because they don’t realize robots.txt supports globing.

~~~
greglindahl
Can you given an example in the wild?

~~~
jedberg
I was actually trying to find an example when I made my initial comment, but
was unable to. It's been a long time since I did web scraping. Since then
there are a lot more frameworks that help you build a website (and a
correspondingly sane robots.txt), so there may not be as many as before.

------
AceJohnny2
fun & useless little bit of trivia: Sci-Fi author [1] Charles Stross (who
hangs around here) is the cause of the first robots.txt being invented.

[http://www.antipope.org/charlie/blog-
static/2009/06/how_i_go...](http://www.antipope.org/charlie/blog-
static/2009/06/how_i_got_here_in_the_end_part_3.html)

(reminds me how Y Combinator's co-founder Robert Morris has a bit of youthful
notoriety from a less innocent program)

[1] and former code monkey from the dot-com era

------
orf
I guess lots of people misspell ~disalow~ disallow[1]

1\.
[https://github.com/google/robotstxt/blob/master/robots.cc#L6...](https://github.com/google/robotstxt/blob/master/robots.cc#L691)

~~~
badrequest
including yourself! Must be easy to do. :) [https://www.merriam-
webster.com/dictionary/disallow](https://www.merriam-
webster.com/dictionary/disallow)

~~~
orf
Oh snap!

------
noir-york
I doubt there's any vulns in the code seeing as its job for th last 20 years
has been to parse input from the wild west that is the internet, and survive.

But I'm sure someone out there will fuzz it...

~~~
dev_dull
I’d be surprised if google isn’t fuzzing it with their (also open sourced)
fuzzing tool.

~~~
jpswade
I'd never heard of this, so looked into it:

[https://opensource.googleblog.com/2019/02/open-sourcing-
clus...](https://opensource.googleblog.com/2019/02/open-sourcing-
clusterfuzz.html)

Fascinating.

------
pedrorijo91
can this been seen as a initiative to make google robots.txt parser the
internet standard? every webmaster will want to be compliant with google
corner cases...

~~~
H8crilA
That's probably the hidden agenda.

~~~
djsavvy
It's not even hidden, Google explicitly says that in the blog post.

------
jhabdas
Anyone else witnessed this behavior?
[https://stackoverflow.com/questions/4769140/robots-txt-
user-...](https://stackoverflow.com/questions/4769140/robots-txt-user-agent-
googlebot-disallow-google-still-indexing/52732538#52732538)

~~~
TomAnthony
There is a difference between robots.txt blocking a page and noindexing a
page.

Blocking in robots.txt will stop Googlebot downloading that page and looking
at the contents, but the page may still make it into the index on the basis of
links to that page making it seem relevant (it will appear in the search
results without a description snippet and will include a note about why).

To have a page not appear in the index you need to use a 'noindex' directive
[1] either in the file itself or in the HTTP headers. However, if the file is
blocked in robots.txt then note Google cannot read that noindex directive.

Also, in the StackOverflow response you linked to that the user agent is
listed just as 'Google', but it should be 'Googlebot' as per the 'User agent
token (product token)' table column listed in [2].

Good luck! :)

[1]
[https://support.google.com/webmasters/answer/93710?hl=en](https://support.google.com/webmasters/answer/93710?hl=en)
[2]
[https://support.google.com/webmasters/answer/1061943](https://support.google.com/webmasters/answer/1061943)

------
nn3
That's actually nice and straight forward and relatively simple. I had
expected something over engineered with at least parts of the code dedicated
on demonstrating how much smarter the code writer is than you. But it's not.
Just a simple parser.

~~~
cmrdporcupine
Looks like standard Google C++ coding style to me.

Honestly, excessive cleverness does not generally pass code review @ Google.
Especially something that would get this many eyes.

~~~
joshuamorton
Honestly I'm most surprised they haven't replaced all the c-isms already.
Seeing raw char pointers and strbrk is...weird.

~~~
cmrdporcupine
Yes as a Googler, I'd probably flag that in review.

But it being old and critical, I'd also be wary of major changes.

------
jaredcwhite
Seems strange to get excited about a robots.txt parser, but I feel oddly
elated that Google decided to open source this. Would it be too much to hope
that additional modules related to Search get released in the future? Google
seems all too happy to play the "open" card except where it directly impacts
their core business, so this is a good step in the right direction.

------
goddtriffin
Looking forward to the robots.txt linters created as wrappers around this
(especially for VSCode).

------
danielovichdk
I find it really cool the code for this is so simple and clean.

------
unchic
I don't understand the entire architecture behind search engines, but this
seems like a pretty decent chunk of it.

What are the chances that Google is releasing this as a preemptive response to
the likely impending antitrust action against them? It would allow the to
respond to those allegations with something like, "all the technology we used
to build a good search engine is out there. We can't help it if we're the most
popular." (And they could say the same about most of their services: gmail,
drive, etc.)

------
Tepix
So, is it premature to expect a Go package by Google as well?

There's already
[https://github.com/temoto/robotstxt](https://github.com/temoto/robotstxt)

~~~
jerf
This is the sort of code you write a binding to and call it a day, since the
entire point is to absolutely precisely match the behavior of this code, which
is basically a specification-by-code. You can never be sure a re-
implementation would be absolutely precisely the same in behavior, so it's not
worth doing.

~~~
helper
The c++ implementation is <1000 lines. Doesn't seem like a correct port would
be particularly difficult, especially with a reasonably large test corpus.

~~~
jerf
Famous last words.

I mean, I get it; it feels that way to me intuitively too. But I'd still
recommend against trying it, because I've learned the hard way the intuition
here is, if not _wrong_ , at the very least _very badly underestimating the
cost_ , especially in the "unknown unknown" department.

~~~
zzzcpan
Do you even need to match Google's robots.txt parsing behavior? With less than
1000 lines you can be pretty sure they are not doing it right and are breaking
plenty of people's assumptions about it. Either way you have to test it on
real world data.

~~~
jerf
The point of this code release seems to be to release Google's precise logic.
That you may incorporate it into something else is, IMHO, less interesting;
we've got plenty of other solutions that "do robots.txt" well enough. If it
was just about that, Google's release of this would not be worth anything. The
point is so that non-Google parties can see exactly what Google is seeing in
your robots.txt.

That's why I'm saying there's no point trying to re-implement this. If you
were going to re-implement this, there's probably already a library that will
work well enough for you. The value here is solely in being _exactly_ what
Google uses; anything that is a "re-implementation" of this code but isn't
_exactly_ what Google uses is missing the point.

If they formalize it into a spec, others may then implement the spec, but they
can and should do that by _implementing the spec_ , not porting this code.

~~~
zzzcpan
As I understand the point about Go complaint is to parse actual real world
robots.txt. For which you don't need to behave exactly as this library does.

------
Jahak
Cool, thanks

------
sandGorgon
Is Golang significantly slower than c++ ? I thought Google had invented Golang
to solve precisely these kinds of code for their internal use.

I had thought most of the systems code inside Google would be golang by now.
is that not the case ? the code doesnt look too big - I dont think porting is
the big issue.

~~~
dgellow
> Is Golang significantly slower than c++ ?

Depends the context, but in general, yes. C++ is very close to C on this
aspect, trading memory safety for performances.

Concerning google, as far as I know the codebase is mostly C++, Java, and
python. Go will surely eat a bit of the Java and Python projects but it’s
unlikely to see C++ being replaced any time soon.

~~~
penagwin
> Is Golang significantly slower than c++ ? > Depends the context, but in
> general, yes.

I don't believe this is the case. Most optimized, natively compiled languages
all perform similarly. Go, C, CPP, Rust, Nim, etc. I'm sure there are edge-
cases where this isn't the case, but they all perform roughly the same.

The performance rift only starts when you introduce some form of a VM, and/or
use an interpreted language. Even then, under certain workloads their
optimizations can put them close to their native counter parts, but otherwise
are generally slower.

The real reason Google didn't re-write this in Go is likely because the
library is already finished, it works, a re-write would require more extensive
testing, etc. Why spend precious man-hours on a needless re-write?

~~~
spullara
Golang and Java have similar performance characteristics. I would not put it
in the same class as C/CPP/Rust.

[https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/fastest/go.html)

~~~
recursive
Presumably less startup time and JIT.

~~~
spullara
Depends on if the Java code is ahead of time compiled or not. The AOT compiler
is included in the latest JDKs.

~~~
igouy
_fwiw_ AoT with Substrate VM

[https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/fastest/java-substratevm.html)

~~~
spullara
Nice, good to know the JIT is doing something. Will be interesting to track
the performance of this over time.

~~~
igouy
And that's _graalvm-ce_ not the enterprise edition.

------
noncoml
“Because the REP was only a de-facto standard for the past 25 years, different
implementers implement parsing of robots.txt slightly differently, leading to
confusion. This project aims to fix that by releasing the parser that Google
uses.”

The amount of arrogance in this sentence is insane.

Because Google way is the only one true way?

~~~
gbear605
In terms of “what should a robots.txt file look like to be parsed correctly,”
yes, because they’re the ones who are going to be doing most of that parsing.
Yes, ideally it would be an entirely independent standardization process, but
it’s not arrogant of them.

------
harry8
Never before has a company stood on such a mountain of open source code,
achieved so much money with it and contributed _so_ _little_

No really. Microsoft? BSD TCP/IP stack for win95 maybe saved them but there
was trumpet winsock and probably would have survived to writing their own on
the next release.

Google doesn't get off the ground and has literally no products and no
services without the GPL code that they fork, provide remote access to a
process running their fork and contribute nothing back. Good end run around
the spirit of the GPL there and that has made them a fortune (they have many
fortunes, that's just one of them).

New projects from google? They're only open source if google really need them
to be, like Go which would get nowhere if it wasn't and be very expensive for
google to have to train all their engineers rather than pushing that cost back
on their employees.

At least they don't go in for software patents, right? Oh, wait...

At least they have a motto of "Don't be evil" Which we pretty much all have
personally but it's great a corporation backs it. Corporate restructurings
happen, sure, oh wait, the motto is now gone. "Do the right thing" Well this
is fine and google do it, for all values of right that equal "profitable to
google and career enhancing for senior execs".

But this is great a robots.txt parser that's open source. Someone other than
google could do something useful for the web with that like writing a
validator, because google won't. Seemingly because it's not their definition
of "do the right thing."

"Better than facebook, better than facebook, any criticism of google is by
people who don't like google so invalid." Only with more words. Or none just
one button. Go.

~~~
supernomad
So you aren't wrong that google is built on the shoulders of giants, but I
will point out that every single company today running their SaaS offering on
top of linux/BSD is doing the exact same thing.

The only reason Linux is as mainstream as it is today, is exactly because of
this freedom to leverage the code. You even point out that the cause for
Golang's success is for precisely the same reason. Overall opensource isn't
about making money, it has never been about making money. Its been about
making an impact, and bettering the world around us all by giving a piece of
technology to be freely used by everyone. There are a variety of opensource
licenses that can/will protect your code from any/all closed source uses, for
example AGPL explicitly states if your application so much as interacts with
the code over a TCP connection or furthermore a single UDP packet it must be
opensource as well. However you will rarely see libraries/applications using
this license. Why you might ask? The answer is simple, it reduces the impact
that code can have.

Really at the end of the day, it comes down to a choice of the developer(s),
do you want to make money? i.e. go the Microsoft/Apple route? or do you want
to make an impact? i.e. go the Linux/BSD route?

Let me ask one final question, which of the above operating systems do you
think are more widely used, or have changed the world in a more dramatic
manner?

~~~
harry8
I could care less about other companies that have existed for 5 minutes in the
SaaS space in my comment that nobody has ever derived more value and given
that, contributed less back.

Google is built on an end run around the spirit and intent of the GPL. "Don't
distribute software, distribute thin client access to it! No GPL! Hurrah!
Money!"

Decide for yourself what you think of that but it happened. Without it, no
google.

But hey, list anyone you think derived more value and contributed less back.
It's a reasonable thing to do. Doesn't affect criticism of google.

