
How I reverse-engineered Google Docs to play back any document's keystrokes - jsomers
http://features.jsomers.net/how-i-reverse-engineered-google-docs/
======
6stringmerc
Clever, very clever.

As somebody who has worked full-time, over-time, and essentially in my sleep
with Word files, PowerPoints, and highly sensitive bid documentation...yeah, I
have to agree working in the cloud for this kind of stuff strikes me as career
suicide. Maybe not in your turf, re: software development, but with respect to
management, operations and marketing people, there should only be one person
with the key to the kingdom. I'm not kidding about this, even if just talking
internal development.

Also, this is why everything went out as a locked down PDF, unless explicitly
mandated otherwise by RFP/etc specifications...and even then, Track Changes >
Accept All Changes is gospel. Anybody in my line of work saw what the .GOV did
with converting PDFs and simply redacting with a graphic over the text...yeah,
that's why I'm a first-class proposal developer, because I've seen carnage yo.

~~~
yzzxy
Even if you are working on something that is not security/negotiation
sensitive, this is still scary, and the layman can approximate it with Google
docs document history and the like.

Document history enables thoughtcrime detection, especially as it becomes more
and more atomic towards actual keystrokes. I wonder how long it will be until
typed and deleted text is used as evidence in a court of law.

~~~
zaroth
As I recall the Steve Jobs / Apple wage collusion lawsuit used Steve's auto-
saved draft emails to show intent.

------
morgante
I don't understand why most of the commenters here are focusing on the privacy
implications rather than the technical aspects.

Is this really a privacy breach? It's been _obvious_ that Google stores
revision history since it launched—you've always been able to access a
thorough revision history in the UI itself...

~~~
eplanit
I think the issue for some (including myself) is that this revelation shows
that the vulnerability of a compromised document is greater than the apparent
contents of the document. It includes all keystrokes, which could expose other
ideas the writer might have had (but discarded), etc. This fact makes me even
less inclined to use Google docs.

~~~
morgante
Right, but why did it take a playback feature for you to realize this?

Google Docs has always had a revision history tab.

~~~
nitrogen
The parent comment was clearly aware of the versioning feature, and appears to
be more surprised by the keypress-level granularity of the versioning that may
not have been apparent in the official UI.

------
character0
The creation story for this is really neat. This could be an amazing tool in
the school setting, especially for people that teach in university writing
centers. This isn't just asking an author for a peek behind the curtain,
asking a few questions about what they were thinking at the time of writing,
this is Breaking the Magician's Code level stuff!

I am curious to see who is (brave enough?) to show their writing process in
all its glory.

~~~
6stringmerc
I'd be happy to upload scans of one of my more recent short stories, "Pink
Paint Rain," simply to show how much true effort goes into massaging lines and
language a la code.

Finished version: [http://www.scribd.com/doc/156040925/Pink-Paint-Rain-
Vernon-W...](http://www.scribd.com/doc/156040925/Pink-Paint-Rain-Vernon-
Walter-July-2013)

I have about 7-10 print outs of the story with mark ups. Before switching to
English & Creative Writing I was in Computer Science and pretty skilled with
C++ at the time, so I can come up with a decent correlative:

Every version I printed was like running it through a compiler - this is to
evidence that even if a piece of code or a line of text is functional, is it
efficient, and, if at all possible, an excellent construction? These are
subjective concepts that are innate in language. There may be several writers
who can put words to paper directly from their head with no revision process -
hell, it's what I do when I'm "practicing" on my IBM Selectric III to commit
to writing in permanence (think before I type) - but for the greats, it has
always been an iterative process.

To keep this from being all about me, allow me to provide a link to something
that may be to your liking:

[http://www.amazon.com/James-Micheners-Writers-Handbook-
Explo...](http://www.amazon.com/James-Micheners-Writers-Handbook-
Explorations/dp/0517197138)

~~~
sqrt17
Hi, I'm a computational linguist and I would find it really great if some
people could share their traces of their typing/editing process.

In programming, we have editors that have strong support for writing because
we know exactly what the semantics of code is, and what good operations for
editing/refactoring are. With writing prose, the best we currently have is
edit histories from wikipedia articles, which are on a much larger timescale
and full of things that should not be part of the editing process (vandalism,
NPOV wars, etc.)

------
michaelx386
I broke out in to a cold sweat watching this as I remembered all the times
I've inadvertently pasted sensitive stuff in to a document. It's still very
cool though, I'll just need to remember to be careful when sharing documents.

~~~
r00fus
I wonder how much sensitive information is inadvertently pasted into a browser
location bar or autofill text box that's silently captured by web apps like
Google Docs?

I know I've accidentally done the "paste password" into those places
accidentally at times.

~~~
lambda
I believe (can't recall the source at the moment) that on Google computers,
they actually watch all of your input for password input, and if you enter
your password somewhere other than the official Google single-sign on
interface, will make you rotate your password. They're pretty serious about
not letting you type your password anywhere other than where you're supposed
to.

~~~
unfamiliar
That sounds like a pretty big security hole. Just "type" random letters until
you get a warning saying not to enter your password outside of password
fields.

~~~
divegeek
There's no such warning displayed, because that would be a security hole.

This password security measure is a Chrome extension that's required by
company policy to be installed on all corporate machines. It watches all input
(to browser forms) and if it detects your password being typed anywhere other
than an actual sign-on page, then the next time you sign on successfully
you're required to change your password. I believe there's also an e-mail
notification, but it's delayed.

This is actually a pretty good password security technique, specifically
because people often inadvertently type their password into the wrong forms
due to focus errors, lack of caffeine, etc.

~~~
TheLoneWolfling
How would you do that though?

Because I can't think of an efficient way to do that that doesn't involve
having the extension have access to the password.

I mean, you could store the password hash + length, but then you're securely
hashing every single overlapping substring of what you enter, which is not
exactly fast. Especially as KDFs are designed to be slow.

And if you store the password hash then you enable an offline attack.

~~~
brazzledazzle
It wouldn't have to do it in real time right? It could easily batch typing
sessions and have the server chew through them asynchronously.

~~~
TheLoneWolfling
So then you're sending every keystroke people make to a central server?

Even assuming that the connection is secure (never a good assumption), that
still means that there is a single point of failure. And one with drastic
consequences.

~~~
brazzledazzle
But doesn't any service that you authenticate against assume the channel is
secure? Presumably this would use SSL.

I do agree about the single point of attack though. Perhaps you could do an
asynchronous substring check locally when the CPU is idle.

~~~
TheLoneWolfling
But then anyone who can gain access to the computer once can then perform an
offline attack on the password at their leisure.

~~~
brazzledazzle
True, but anyone that can gain privileged access to the computer is already
king of the castle. Why attack it offline when you can just keylog it? I think
it goes back to being one part of an overall security posture. Encrypt your
workstations and people can't just pick them up and own them.

------
kens
I find it fascinating to see how much deleting and rewriting the author did on
the first two sentences of his Atlantic article. You can see the idea getting
rewritten in many ways.

Is this a typical way to write a magazine article? I wouldn't have expected so
much time revising the opening sentences before getting the rest of the
article in place. (But there's probably a lot of variation between writers.)

~~~
brk
I've written a fair number of magazine articles, and also a lot of white
papers and other documents.

For me, I usually spend a day or two just thinking about it in my head. Going
over what would be a logical thought flow and things like that. When I sit
down to actually _write_ , I tend to have very few revisions. My first draft
is much closer to the typical persons 5th draft (I think), but that's because
I've been revising and editing in my head first.

~~~
fillskills
Is this a practised habit? I mean, when you started out, did you do more
revisions on paper and less in your head?

~~~
brk
Yes, it's a practiced habit.

I found that when I first started writing regularly I would spend a lot of
time doing constant editing similar to the example shown in the linked
article. This would become distracting and time consuming, then I'd forget
other things I had wanted to say. So I found it better, for me, to just kind
of write things in my head first and then sit down and write -almost more
transcribing vs. "writing".

~~~
fillskills
Thanks, thats really helpful for me. I am just starting to learn to write blog
articles for my business and I certainly do a LOT of editing. Glad to see
someone progress to where I want to be at... one day

~~~
e12e
If you haven't [read it] already, I strongly recommend "On Writing Well" by
William Zinsser.

[edit: unintentionally demonstrating the need for more than one revision...]

------
forca
This is a very good reason to never use software in the so-called "cloud". I
also remember years ago when someone showed me "Track Changes" in MS Word and
other programs and how you could go back and look at, say, a bid offer and see
if everything was on the up and up. You could see, esp. if the document was a
form letter or canned response, to which other companies were offered
different terms, you name it.

I dislike revision-able software for a number of reasons. Privacy is the
foremost reason. Yes, yes, "if you've nothing to hide, you've nothing to
fear..." That old chestnut gets trotted out every time someone worries about
security or privacy.

Since about 2000, I keep my documents in plain text only on an encrypted drive
backed up several times over -- none of the backups are online, but I'm still
good if my house burns down, my machines get stolen, you name it.

No, just no.

~~~
sbarre
You can also take advantage of all the great bonuses of revisions and then
before sending the document just copy/paste the content into a new document
that doesn't include the revisions. That seems sensible too.

~~~
bostonpete
I'm pretty sure you can just "Accept all changes" before saving the document
too and the change history will be cleared.

~~~
e12e
Does sound like a very fragile workflow (there's no reasonable way to tell a
full-history doc from a "publish grade" doc by glancing at the file on the
filesystem.

Keeping everything in proper version control (possibly unzipped, to give
usable diffs even for office document formats -- or in something like
markdown) -- would at least rise the bar a bit -- there'd be different process
for sending a single version of a file, and sending all versions of (all) [a]
file(s).

I suppose if you're already running an internal mail server, you could just do
filtering there, making sure no version/history-rich documents pass out that
way...

------
bjoe_lewis
For more information about the grammer/schema for document operations, which
is actually what is transmitted in the /save call of gdocs,

[1] [http://wave-
protocol.googlecode.com/hg/spec/federation/waves...](http://wave-
protocol.googlecode.com/hg/spec/federation/wavespec.html#anchor39)

[2] [http://code.google.com/p/wave-protocol/](http://code.google.com/p/wave-
protocol/) \- wave protocol project (initiated by google, now maitained by
apache) is the root from where gdocs adopted OT.

------
marknadal
Great article, fascinating.

The author mentions that his system doesn't handle rich text, which is fine,
but I'd just like to comment on how difficult of a problem handling rich text
is. If anyone is interested in having a personal text-only replay editor,
check out [http://sharejs.org/](http://sharejs.org/) by an ex-Google Wave
engineer.

As far as handling rich text, I've talked to the original co-founders of
Writely (which became Google Docs), and I've also spent a good 8+ months on it
as well. There are lots of tradeoffs involved, that diff-patch-match (as
mentioned in the article) won't work on. Doc's ultimately expresses styles as
applied ranges, rather than actual markup.

Point being, Google keeping every keystroke you've made is absolutely
necessary for realtime collaborative writing.

~~~
morgante
Yup, collaborative rich text editing is a surprisingly thorny problem.

ShareJS and Quill have actually been making some great progress on this
though:
[https://github.com/share/ShareJS/issues/1](https://github.com/share/ShareJS/issues/1)

~~~
marknadal
As you might have seen... I am "amark" in the beginning of the thread. Wow, it
is still going, I'm gonna have to read all the updates!

------
jrochkind1
Lawyers doing discovery will definitely want to know about this.

So will document retention specialists trying to foil laywers doing discovery.

So will hackers looking for sensitive information, and security specialists
looking to avoid sharing sensitive information.

There probably really ought to be an "erase history" function.

------
userbinator
This is one of the reasons why it's great that the source code of web
pages/apps is (relatively, compared to binaries) easy to reverse-engineer -
because of their environment inside a browser, web apps have such a low
barrier to "phoning home" and making requests that privacy-sensitive
information being leaked may otherwise be difficult to notice. Imagine if they
were all encrypted/obfuscated binaries...

I don't use Google Docs (and probably never will), but if I did, all those
requests - "these /save calls every time I typed something" \- would be enough
for me to investigate why it's generating so much traffic. I'm using an OS
that still has a useful network activity indicator icon, so I easily know when
there's data being transmitted/received when there shouldn't be.

There's a line of thought that says those sorts of indicators are unnecessary
and a distraction, and that maybe valid justification for removing them, but I
can't help feeling like their removal is making users more unaware of what
their machines are doing - and thus easier for companies to do things like
this to them.

~~~
traek
When you say that "privacy-sensitive information" is "being leaked", you make
it sound much worse than it is. The information being sent seems completely
normal for an online word processor with a revision history, and it's not
being "leaked" to anyone besides the company providing the word processor.

~~~
userbinator
_The information being sent seems completely normal for an online word
processor with a revision history_

When most people hear "revision history", they think of the versions of the
document that exist between explicit saves or periodic autosaves, and not
extremely fine-grained per-keystroke activity logging.

------
misingnoglic
I remember reading an article about how it's a shame that authors don't use
pen/paper anymore, since we can't see their crossouts and things for rough
drafts. I'd argue that this would be infinitely superior if authors would give
us access to some revision history.

~~~
TheGeminon
You do still miss out on doodles in the margins however

~~~
hunter2_
If the entire computing session, instead of just the word processor, is
similarly cloud-hosted with similarly granular revision history, you've got
your margins.

------
ChicagoBoy11
I've tried using the example URL on his blog with one of my documents, just to
see exactly how the information is stored, and I could never get it to send me
an actual response with any of my documents. Has anyone had any luck?

------
mintplant
Didn't Google Docs used to have this "playback" feature built in? I clearly
remember there being a slider at the top of the page that you could scrub back
and forth through a document's revision history.

~~~
haylem
I think you've got it confused with some implementations of EtherPad. At least
I know I've seen it there, but I don't recall seeing it in Google Docs.

~~~
mintplant
No, I've never used EtherPad before. It's probably Google Wave I'm thinking
of.

------
pjc50
Interesting. Is this a side-effect of the old Google Wave design, which had
collaborative documents where you could watch your collaborator type in
realtime?

~~~
skybrian
Operational transformations are used by both Google Wave (now Apache Wave) and
Google Docs [1]. The basic idea is to avoid latency by making sure all edit
operations commute, so patches can be applied out of order to get the same
result.

This is not so different from source control except that merge conflicts are
handled differently.

Differential synchronization [2] might be easier to implement, though.

[1]
[http://en.wikipedia.org/wiki/Operational_transformation](http://en.wikipedia.org/wiki/Operational_transformation)

[2]
[https://neil.fraser.name/writing/sync/](https://neil.fraser.name/writing/sync/)

~~~
A1kmm
Storing the operations for ever is not necessary - they could apply an upper
bound on what latency is reasonable; the server then could coalesce changes
into groups (sorted by position in the document, transformed into a position
at the end of the group of changes) for history. For undo, they could use
grouped changes after a certain number of operations of history.

They would then be able to implement sharing without allowing access to the
history.

Alternatively, and more easily, they could record the state when someone is
given access to the document, and not allow access to operations received by
the server before then unless they are granted a separate permission.

------
skywhopper
Interesting perspective by the author and most of the commenters here. My
first thought when I read the headline and the article was "privacy breach!".
This is certainly interesting data. But it can also be dangerous if the owner
of the document isn't aware of the implications of the storage format.

~~~
drivingmenuts
Well, I'm aware now.

I had no idea they stored the edit history.

------
mattnewport
When can we get an IDE that can do this, and track copy / paste across source
files :)

~~~
williamstein
SageMathCloud's IDE (which is CodeMirror-based, so similar to Adobe brackets,
but online) does this, with a nice slider like in "pirate pad". Just create a
file and start editing and it will record diffs at about a 1-second interval,
which actively editing. Click on the blue "History" button and you get a
slider across past revisions. Jon Lee implemented this functionality last
summer. [https://cloud.sagemath.com](https://cloud.sagemath.com) I frequently
use this fine history functionality when coding. I'll remember that I had my
code in some useful state 15 minutes ago (say in the middle between git
commits), and I can easily drag the slider back to that point in time and look
at the code. It adds a whole new dimension two coding that dramatically
improves things. Unlike with Google docs, the SageMathCloud revision history
for a file foo is simply stored in the file .foo.sage-history, with one diff
per line (in JSON format). You can delete .foo.sage-history or archive it or
whatever.

------
hjnilsson
Amazing fun, browsed through my master thesis revision history and found some
interesting tidbits!

A suggestion for further evolution would be an option to color text by the
author that wrote it (for collaborated documents).

------
jacquesm
Very nice, not only because it shows you a playback of Google Docs documents
but also because the author takes the time to note his inspiration, first
drafts of the project and eventual evolution of it.

------
tlb
Interesting -- I would not have guessed they were storing all our keystrokes.
It'd be fascinating to mine that data, for example, to find patterns of typos.

~~~
jaredmcateer
Anyone that has used the multi-user collaborative editing should not be
surprised that at the very least your keystrokes are regularly being
transmitted to Google for playback on other people's computers. It's not so
much of a leap to assume that they are also stored given Google's lust for
data.

------
maaaats
Very cool. Just spent the day writing some integration for Google Drive, and
was a bit amused when writing a single sentence in Google Doc increased the
"Changes.list()" id with ~40.

But I'm curious: Can one delete these kind of revisions displayed here? Those
visible in the GDocs UI are only a few, mayor revisions (which may be
troublesome in itself for people not knowing about it and sharing a document).

------
atmosx
Hahaha I loved the _Hack the Gibson_ touch.

------
vtbose
Reminds me of the submission
[https://news.ycombinator.com/item?id=557191](https://news.ycombinator.com/item?id=557191)
where we could see pg's thought process in writing the Founders Visa essay.
(the link no longer works)

~~~
vilhelm_s
The OP also mentions this, and has a link to a working copy:
[https://code.stypi.com/hacks/13sentences](https://code.stypi.com/hacks/13sentences)

------
haylem
Fine, there's the worry you typed in something sensitive in the document. But
it's far from being the major worry here, if I understood everything
correctly...

Basically, if you have all keystrokes with timing info, you've got all the
keystroke dynamics required to establish an individual's biometric keystroke
fingerprint. And that seems freaking scary to me, for a few reasons.

1) Impersonation

Anyone can grab that fingerprint from any shared file on Google Docs and then
feed it to a program so that you can impersonate the author for various
purposes... Be it typing a blog comment (harmless), or something more
insidious like logging into a secure system using that type of auth system.

Another trivial example could be impersonating someone on a Coursera course,
where they use such fingerprints for identification on paid / "signature
track" courses, which allow you to get a verified certificate from a known
university. They use a photo, but that can also be fed by a tweaked webcam
driver. So there you have it, you can hire someone to take lessons and pass
exams for you. Or fail you.

2) Anonymity

Anyone with access to a shared google doc can now get your fingerprint, and if
they implement a similar record on another website, they can identify who you
are. Maybe such fingerprints are not 100% unique, but they surely can be
accurate enough to pick you from a crowd of anonymous commenters on a website,
for instance.

You could also imagine that a software you already have installed could
identify you using a similar approach.

In that case, surfing using a Tails/Tor VM and in the incognito mode of your
browser won't help you that much.

I'm sure in a perfect world this could sound awesome: no logins required
anywhere, just type in stuff and get automatically ID-ed and credited for what
you type and say. In our world, that could be bad for some people.

Plus I can only imagine how bad that would be if companies started to include
in their web and desktop apps EULA that you agree to share keystroke dynamics
with them and that you auhtorize them to redistribute it to partners. BAM, a
global commercial database of uniquely identified users, no matter what
account or throwaway email they use. Forget cookies and stuff, that won't need
that anymore.

Bit far-fetched of course, as that would require some effort. But it's not
that much effort that it wouldn't be interesting enough for someone to do
it...

~~~
haylem
Well, you know what, I actually realized that there's no timing info in the
recorded data. So, no problem. I jumped the gun quite a bit.

Still, a bit worrysome, because it could easily be modified to track it. And
for all we know, some sites could be doing that. Facebook was (at least at
some point) listening to what you were typing in timeline posts even if you
didn't actually decide to send them, so it wouldn't be surprising if some
sites did that sort of stuff.

Interesting project idea...

~~~
divegeek
The article says there is timing data, with microsecond resolution. That's how
the author's tool is able to provide "real-time" playback.

~~~
haylem
Hmm, true, it's mentioned in first paragraph, but I couldn't find it in the
rest of the article's body when I came back to it. Well, this is rather bad
then...

------
hftf
Has the author made any insights into reverse-engineering Google Docs’ spell
checking?

------
kdma
Very cool read!but i dont seem to understand how the algorithm behind the
whole phrase/paragraph tracking is supposed to work, anyone can enlighten me?

------
joshdance
This is awesome. I love stuff like this, using incidental data to do something
very cool. Major props.

------
itsbits
So there is no way to clear revision history??..if yes why bother about
privacy??

------
joshfraser
I wish this was built as a Chrome extension instead of asking for OAuth
access.

------
ThisIBereave
Nice! Now I just need someone to make an emacs mode that does this.

------
steele
Very interesting performance art opportunity here.

~~~
gohrt
Google has run TV adds of this sort in the past.

------
franciscop
Horizontal Scroll of the Death...

------
zackify
please add an SSL cert at least

------
iclems
Funny and interesting. Did you know that firepad.io is a rich-text OT text
editor with timestamps? I think it's just what you need. You get a real-time
_collaborative_ text editor with a fully featured toolbar, and the exact
history you need to replay.

Actually, Firepad does replay the history to display the current version on
load (though it also has some snapshotting system to restart faster, but
snapshots do not erase the history, they are kept in another location).

