
Googlebot’s JavaScript random() function is deterministic - TomAnthony
http://www.tomanthony.co.uk/blog/googlebot-javascript-random/
======
geocar
I wrote about something like this previously[1]

Ads often have something like this attached to the end:

    
    
        load("https://adserver/track.gif?{stuff}&_=" + Math.random())
    

My ad server collected multiple tracking pixels -- for various events and
after five pixels I could fingerprint the browser (Firefox, Chrome, MSIE, etc)
and identify someone fiddling with the user agent string, or using a proxy
server to mask this information.

[1]: [http://geocar.sdf1.org/browser-
verification.html](http://geocar.sdf1.org/browser-verification.html)

~~~
na85
How would a person defend against this fingerprinting?

~~~
geocar
Buy an iPhone.

• Did proper random before anybody else

• Active countermeasures against cookie-based retargeting

• Popular enough market that merely having an iPhone in a geographic area
doesn't single you out

There's a guy who downloads every page with curl. I see him on web logs. I
think he must have some script that parses the amp pages out and does
something with it, but because he's the only person in that geographic region
who browses the web with curl, he's very easy to spot from a tracking
perspective. On the other hand, because he's using curl, I don't think anyone
wants to bother trying to show him an ad.

~~~
ccozan
That would be Stallman.

~~~
rkeene2
Stallman would probably use GNU wget which is licensed under the terms of the
GPL, while curl is licensed under a license derived from MIT/X consortium.

~~~
M_Bakhtiari
I've never heard him say anything bad about running MIT licensed software.

On the other hand, I think I've read a FAQ where he says he views web pages by
having a script fetch the pages with wget and them emailing them to him, but
I'm pretty sure that's simply because wget is free and does the job (being
able to recursively download the necessary resources), not because it's GPL
licensed.

~~~
kbenson
Obviously Stallman is modifying his wget user agent to show curl just to
prevent fingerprinting him accurately... ;)

------
jchw
Seems reasonable to me. My guess is that it's not performance, but rather
predictability, that matters here. Being able to detect when a page
meaningfully changes is probably useful for Google, and a good implementation
of Math.random() would potentially thwart that. Especially seeing how many
pages have the magic constant in them...

Also, probably useful for determining two pages are the same, which may be
needed to help prevent the crawler from crawling a million paths into a SPA
that don't actually exist, for example.

~~~
taf2
I wonder if they also reimplement or adjust crypto api since it offers better
random numbers

~~~
jchw
If I had my guess, my guess would be Googlebot simply disables the API. It's
new enough that this would be reasonable, and executing real crypto in context
of Googlebot is probably rarely desirable.

------
sevensor
Seems like the expectation on PRNGs being expressed in this thread is a bit
unrealistic. They're always deterministic. The fact that this PRNG is also
always seeded the same makes it easier to fingerprint, but that has no bearing
on whether the PRNG is deterministic.

~~~
yndoendo
Random should be undetermined. While PRNG should only be for cryptography.
Both serve two different purposes.

~~~
sevensor
That would be terrible! I rely on the deterministic behavior of PRNGs all the
time. For instance, I often generate random test vectors. If I have a failure,
I want it to be reproducible so I can fix it. And it is, as long as I supply
the same seed.

~~~
yndoendo
So you found another use for PRNGs. Random should be close to real world
multi-side dice with seed entropy used from multiple sources such as video
card, nic, and storage buffers.

PRNG != Random.

~~~
sevensor
There are lots more examples: heuristic optimization, discrete event
simulation, sampling... I could go on. Deterministic RNGs are much better in
all of these applications, where reproducibility of results is important. I'm
sure nondeterministic RNGs have important uses too. Perhaps you'd care to
describe some of them.

~~~
vlovich123
I know you're being facetious because the OP isn't correct that deterministic
PRNGs aren't useful, but any cryptographic application of a PRNG should be
non-deterministic.

~~~
sevensor
Well, crypto isn't my specialty, so I always tread lightly when commenting
about it. My hand-wavey understanding is that CSPRNGs are deterministic but
not predictable, and that actual entropy is used to seed them. Going even
further out on a limb, I think this is supposed to be the difference between
/dev/urandom (CSPRNG) and /dev/random (actual entropy.) If I have that wrong,
I'd appreciate correction by somebody in the know.

~~~
vlovich123
You're correct that CSPRNGs themselves are deterministic. I probably just
misread what you were saying. As for dev/urandom vs /dev/random that's not
really true. On Linux there's kind of a historical artifact of why they're
different (blocking vs non-blocking API) but on OSX /dev/urandom is a symlink
to /dev/random.

------
lolc
Maybe Googlebot forks its JS-engine from a pre-initialized image. That would
explain the unchanging seed.

~~~
simias
Regardless of other technical limitations I'm guessing that it's actually done
on purpose, as point #3 in TFA states:

>Predictable – Googlebot can trust a page will render the same on each visit

It's probably important for Google's crawler to identify whether a page
changed or not, if some elements in a page are randomly generated they may
want to limit the impact.

I mean, after all they seem to use a real, changing value for their Date, so
if they wanted they could just seed their RNG with that.

------
endymi0n
This is known for a while and used to cause huge problems for our tracking:
[https://github.com/snowplow/snowplow-javascript-
tracker/issu...](https://github.com/snowplow/snowplow-javascript-
tracker/issues/499)

~~~
lotyrin
Under the assumption that the same event fired from the same IP at the same
time, with the same environment, etc. would be considered a duplicate in your
system, I'd have designed this system to be idempotent-insert-only and use
content-addressing instead of nonces for identity (event ID = suitably large
hash of event data to avoid collision). If that assumption doesn't hold, then
add your nonce to the event data (and thereby modify the hash).

------
amelius
Nice to see fingerprinting used _against_ Google (instead of _by_ Google). But
this is easy to fix, so I don't expect this to work for much longer.

~~~
jrochkind1
I think how easy it is for Google to fix to make detecting googlebot this way
harder depends on why Googlebot is doing it in the first place, which we don't
really know. If it's done for performance reasons, or for predictability
reasons (rendering the same page twice guaranteed or at least more likely to
produce the same result), it might be difficult to change without cost.

But I believe Googlebot always faithfully sends it's user-agent. Is there a
reason Google would care about 'fixing' this to make Googlebot harder to
detect via random() predictability, when you can always just detect it via
user-agent anyway? I'm not sure, curious if others have thoughts!

~~~
Kesty
Google has checks in place to see if someone is serving things to GoogleBot
differently that the rest of the users. So it almost definitely has bots that
double checks pages without the user-agent.

If the "disguised" googlebot is the same as the actual one, chances are it is
since it would want to be as close as possible to not flag false positives,
and use the same seed for consistency then you might be able to use that to
avoid detection on the fact that you are serving google something different
than normal users.

Newspaper used/do that to be able to have their full article content indexed
while serving a paywall to everyone else.

------
kmbriedis
Every random() function is kind of deterministic, but still very interesting
discovery!

~~~
skrebbel
Totally down nitpick alley yeah, but: Not if fed by an external source of
randomness (eg cosmic rays).

~~~
philbarr
> (eg cosmic rays).

YouTube copyright violations?

The chance of getting a response from Google support?

~~~
skrebbel
I love those. A bit easy to game if you have the right job at Google, but
still this is cool.

I'd also assume that the Twitter firehose could be a great source of
randomness.

------
nukeop
What's a plausible real world use case for this if one wanted to exploit this
to game SEO? Can it even be exploited in any way?

~~~
TomAnthony
OP here. I made a silly function that identifies Googlebot, but has 'plausible
deniability' built in:

[http://www.tomanthony.co.uk/fun/googlebot_puzzle.html](http://www.tomanthony.co.uk/fun/googlebot_puzzle.html)

The idea being you could use a function to always send Googlebot down one
execution path and users down another, and make it look like you are doing an
AB test on a small set of traffic. You could then do something nefarious, such
as add spammy content to the page for Googlebot.

However, in reality it is not likely to be a viable tactic for any decent
quality site, and unlikely to have much of an impact on lower quality sites
that may be willing to risk it.

It is an interesting idea though, and we research this sort of thing such that
we can better identify such behaviours in competitors of our clients.

~~~
codedokode
You can detect Googlebot by IP address or User-Agent which is much easier.

~~~
TomAnthony
Yes you can - but it lacks the 'plausible deniability'. Especially if you have
it in your frontend code. :)

~~~
realPubkey
After doing seo for 10 years, I assure you that plausible deniability will not
give you an advantage. Google will punish your site and not listen to your
arguments.

------
herodotus
For those (like me) who are not that familiar with Javascript, the Javascript
spec for Math.random says: "....The implementation selects the initial seed to
the random number generation algorithm; it cannot be chosen or reset by the
user." Furthermore, the seed usually changes. It seems that Google has
modified their Javascript library, perhaps by allowing an explicit Math.Seed
function.

------
some1else
I employed a seed-based deterministic random function in a WebWorker once, to
noisily but predictably drive an animation. I suppose one could use the same
approach to have a decent non-deterministic source alongside Google's patched
seedrandom.

------
onion2k
I guess they used the XKCD method of random number generation:
[https://xkcd.com/221/](https://xkcd.com/221/)

It is actually truly random if you have a fair dice.

~~~
dagurp
Or Dilbert
[http://dilbert.com/strip/2001-10-25](http://dilbert.com/strip/2001-10-25)

~~~
1_800_UNICORN
I still remember this comic strip 16 years after I saw it in the paper.

------
yueq
def roll(): return 4

~~~
nimell
Python version of: [https://xkcd.com/221/](https://xkcd.com/221/) ?

------
est
This makes me wonder if Chrome Headless has enough entropy for random() if
running on a Linux server.

~~~
meirelles
Entropy is not a problem, surely! The choice to return deterministic values
for random() is a feature, not a bug. When you are crawling pages, looking for
meaningful content changes and new links you probably don't want your random()
creating noise.

------
partycoder
PRNGs are deterministic. Even the PRNG shipped in processors available through
the RDSEED/RDRND instructions is deterministic.

Unless you are using some form of entropy, e.g: dedicated hardware, that will
be the case.

~~~
Kesty
The article doesn't simply say it's deterministic but the seed doesn't change.

------
h000per
The google queries for "roll a dice" and "flip a coin" aren't actually random
either. They seem to be based off the current time.

~~~
anc84
Could you add a lot more detail than just a blank statement like this?

