
Announcing Firepad — Our Open Source Collaborative Text Editor - avolcano
https://www.firebase.com/blog/2013-04-09-firepad-open-source-realtime-collaborative-editor.html
======
yuchi
I'm happy to see [Substance](<http://substance.io>) work (Tim's OT) to spread
to other related project!

~~~
mikelehen
Yes! His work is great. Made for a great starting point for building Firepad.
:-)

------
mbrock
Does this depend on proprietary server tech hosted by Firebase?

~~~
toddmorey
Yes, it does.

"Firepad has no server dependencies and instead relies on Firebase [hosted
service] for real-time data synchronization."

So the front-end editor portion of it is open source, but the data sync
between connected clients happens via Firebase.

You can find more info at the bottom of this page: <http://www.firepad.io/>

~~~
mbrock
Thanks. Seems cool, I just wasn't sure, and phrasing like "Firepad has no
server dependencies" had me picturing some advanced WebRTC-based peer syncing
or something, which is pretty far-fetched but you never know these days...

~~~
derefr
Technically, the whole point of OT is to allow for "some advanced [...] peer
syncing or something"; any OT peer can actually synchronize state against any
other peer. The original OT applications were all peer-to-peer apps. It's very
similar to git, actually: everyone is on a particular commit[1], where each
commit has another commit as its parent; clients can create regular commits,
and thus move apart in state-space; and then they can accept others' commits
to move back together (creating a merge commit in the process.) Very
decentralized.

Ever since Etherpad/Google Wave/Google Docs, though, OT has been consistently
mangled into a hub-and-spoke design, where there's a server who keeps a
canonical state, and the OT tie-breaking algorithm is always decided in the
server's favor.

Used in this fashion, it's not much different (actually a bit higher-overhead
than) just having the server arbitrarily accept and temporally-order input
operations in a way it likes, and then send back a single overwriting
transformation to all clients to move them from their last _server-known_
commits to the new server-canonical commit. (In other words, what any
multiplayer game's logic server does.)

[1] One major difference is that OT "commits" are referred to by their vector-
clock, instead of by their SHA. Cryptographic hashes are nice, and would give
OT even cooler properties if it could use them, but they're a bit too slow for
something that gets regenerated with every character typed. CRCs might be fast
enough, but in an actually-distributed peer-to-peer system you'll get real
collisions fast.

~~~
therockhead
Do you know of any JS OT framework that supports peer to peer?

~~~
mikelehen
I don't know the state of the project, but while researching OT stuff for
Firepad, I came across <https://github.com/sveith/jinfinote>. It looks like a
peer-to-peer OT implementation (I think it requires a server to set up the
initial session, but after that, everything could be peer-to-peer [as
implemented, the server just relays messages with no processing]).

------
Johnyma22
Check out the Etherpad plugins, stuff like showing cursor position is
available <http://beta.etherpad.org> has it enabled for example.. Etherpad is
extendable and does support code editing with syntax highlighting, you just
have to jump into /admin/plugins and click "install" on whatever plugin you
want :)

~~~
glavata
Loved etherpad before, and still do :)

------
gnuvince
And it's web based, and thus I will never use it.

~~~
macspoofing
That's a strange restriction you put on yourself? Today you have to have a
good reason for your product to NOT be web-based.

~~~
lutusp
> Today you have to have a good reason for your product to NOT be web-based.

For a document that the visitor cannot afford to lose, or for anything in
which privacy, security or personal control are issues, a Web-based document
is _obviously a mistake_. Why do you think the big players in cloud-based and
Web-based storage and applications are having such a hard time getting people
to adopt them?

The answer is obvious -- _the drawbacks greatly outweigh the advantages_. The
risk of losing sensitive content or having it be compromised or stolen is too
great.

Even games, a less serious endeavor, suffer when they adopt a Web-based
approach. Look at the brouhaha that followed from making the most recent
SimCity version work online-only:

[http://www.gamesindustry.biz/articles/2013-03-15-ea-
defends-...](http://www.gamesindustry.biz/articles/2013-03-15-ea-defends-
always-online-simcity)

A quote:

\------------------------------------

"So, could we have built a subset offline mode? Yes," Bradshaw admitted. "But
we rejected that idea because it didn't fit with our vision...The SimCity we
delivered captures the magic of its heritage but catches up with ever-
improving technology."

A number of upset fans in the comments section to Bradshaw's update were not
assuaged by the explanation. As one user going by the handle klymen wrote,
"With all due respect, to write an article about why the new SimCity has to be
always online and not to mention DRM or the anti-piracy measure even once, is
dishonest and downright disrespectful. It seems like that you don't think very
highly of your audience. Yes, the DRM issue is a sensitive topic, but to avoid
it, not to mention it as one of the reason why SimCity is always online, shows
absolute disconnect and contempt for your fans. We're not stupid."

\------------------------------------

The above is just about a game, not a business transaction or potentially
sensitive communication.

As to data loss, consider incidents like this:

Title: Amazon's Cloud Crash Disaster Permanently Destroyed Many Customers'
Data

Link:
[http://articles.businessinsider.com/2011-04-28/tech/29958976...](http://articles.businessinsider.com/2011-04-28/tech/29958976_1_amazon-
customer-customers-data-data-loss)

Quote: "In addition to taking down the sites of dozens of high-profile
companies for hours (and, in some cases, days), Amazon's huge EC2 cloud
services crash permanently destroyed some data.

The data loss was apparently small relative to the total data stored, but
anyone who runs a web site can immediately understand how terrifying a
prospect any data loss is."

In conclusion, and to second the SimCity gamer's quote above, we're not
stupid.

~~~
macspoofing
First things first, web-based doesn't imply cloud-based. You can have a web-
based service that's hosted on your servers and only services your intranet.
This way you get most of the advantages of the web (no software to install and
maintain, and easy access to the resources by your users), and still maintain
control over your stack. There is very little reason to build desktop software
anymore.

>For a document that the visitor cannot afford to lose, or for anything in
which privacy, security or personal control are issues, a Web-based document
is obviously a mistake

I don't see the obviousness of this. In fact, if it's a document that you
cannot afford to lose, a cloud service makes much more sense than something
stored locally (even with backup procedures). There are cloud services that
handle sensitive data (such as patient records and images) today,
successfully. Yes, there may be cases in which it makes sense to have data
reside on your servers, as opposed to on some cloud-provider's, but those are
edge cases now. We recently went through something like this at work, instead
of hosting and maintaining our own Sharepoint servers, we went with a cloud-
based CRM. It makes too much sense. Our source code, which is by far the most
valuable piece of our business, is hosted on kiln. We do have backup
strategies in cases kiln servers get hit by an asteroid, but we have no qualms
about fogcreek maintaining our codebase.

>Why do you think the big players in cloud-based and Web-based storage and
applications are having such a hard time getting people to adopt them?

That is absolutely false. In fact, the opposite is true. The trend has been to
offload almost everything to the cloud. The big cloud-storage guys all had
phenomenal growth.

>The data loss was apparently small relative to the total data stored, but
anyone who runs a web site can immediately understand how terrifying a
prospect any data loss is."

And how are you immune to this when YOU are responsible for managing your
backup strategy and maintain your servers. Do you know how many horror stories
there are of data loss that occurred because of things like bad RAID setup.
Backup, replication, server maintenance is hard, expensive and time consuming,
and most of the time it has no relevance to the underlying business. If you're
in the business of making plastic widgets, you want to focus on making plastic
widgets, and leave server maintenance to those whose entire business is server
maintenance.

~~~
lutusp
> First things first, web-based doesn't imply cloud-based.

True, but there are few applications that reside in a browser that don't use
cloud-based storage for the results. One may safely refer to Web-based and
cloud-based technologies in a single breath.

>> Why do you think the big players in cloud-based and Web-based storage and
applications are having such a hard time getting people to adopt them?

> That is absolutely false.

No, it's true, and you need to do impartial research before making this sort
of claim. The big players are having a hard time getting people to adopt
cloud-based and Web-based technologies, and I already gave the reasons.

[http://www.infoworld.com/d/cloud-computing/its-cloud-
resista...](http://www.infoworld.com/d/cloud-computing/its-cloud-resistance-
starting-annoy-businesses-383)

A quote: "Accenture and the LSE surveyed more than 1,035 business and IT
executives and conducted more than 35 interviews with cloud providers, system
integrators, and cloud service users. The key finding: There's a gap between
business and IT. Businesspeople see the excitement and business benefits of
cloud computing, so they're pushing for it. However, IT people see cloud
computing as causing issues with security and lock-in, _so they're pushing
back_."

> And how are you immune to this when YOU are responsible for managing your
> backup strategy and maintain your servers.

This is a non-argument fort an obvious reason -- if infrastructure data loss
is an issue, Web-based data loss is a bigger issue, because in the latter
case, users won't necessarily know where the data are located, and the number
of possible failure modes is higher.

> Do you know how many horror stories there are of data loss that occurred
> because of things like bad RAID setup.

I can't believe you even posted this argument. How does an unreliable cloud
RAID array constitute an improvement over an unreliable infrastructure RAID
array?

I haven't even mentioned the legal issues, where law enforcement has a much
easier time subpoenaing evidence from the cloud, compared to legally acquiring
from your local network.

[http://www.forbes.com/2010/04/12/cloud-computing-
enterprise-...](http://www.forbes.com/2010/04/12/cloud-computing-enterprise-
technology-cio-network-legal.html)

A quote: "Enterprises are moving their assets to the cloud to capture its many
business benefits, including ease of deployment and reducing, if not
eliminating, the need for IT infrastructure. However, cloud computing offers
an array of pitfalls for the unwary. _The unique legal risks and
considerations presented by the cloud are especially important and often
overlooked by nonlawyers_."

The article goes on to list five very serious and often overlooked legal
pitfalls of cloud computing.

~~~
macspoofing
>One may safely refer to Web-based and cloud-based technologies in a single
breath.

It may be one of those things that needs to be qualified. Pretty much every
Fortune 1000 enterprise runs some kind of a web-based intranet, which may or
may not be accessible outside the VPN, with various services, from email, to
document management, to ... anything.

>if infrastructure data loss is an issue, Web-based data loss is a bigger
issue

HOW?! First, there is nothing preventing you from having your own backups.
Second, even if you completely trust the cloud provider (and who says you
should?), I claim that it is still safer than managing your own data for most
business, especially if your business cannot afford a top-notch IT support
staff (or any staff). If you're GE, you can invest in server-farms, if you're
Plastic Widget Inc. you're better off with a reputable cloud vendor.

> Businesspeople see the excitement and business benefits of cloud computing,
> so they're pushing for it. However, IT people see cloud computing as causing
> issues with security and lock-in, so they're pushing back.

God-bless SysAdmins, but they do have a tendency to be anti-anything that
comes in on their turf. They are almost never the decision makers. Having said
that, you do realize that cloud services went from nothing (a few years ago)
to a huge multi-hundred-billion dollar industry in the span of a few years,
and growing. Clearly, SOMEBODY sees values.

> The unique legal risks and considerations presented by the cloud are
> especially important and often overlooked by nonlawyers.

Yes, there are "unique legal risks and considerations". What's your point?
There are risks to cloud services, but there are incredible benefits as well.
One always weighs risk and reward accordingly. The rewards is why the industry
is growing. Here's an example of a 'unique legal consideration', Canadian
hospitals cannot use cloud providers hosted on Amazon or anywhere in the US to
host patient data because of things like the Patriot Act, so what do they do?
They can go with a regional cloud provider that makes a guarantee that their
data will not leave the province. I've seen that happen.

~~~
lutusp
>> if infrastructure data loss is an issue, Web-based data loss is a bigger
issue

> HOW?!

Because there are more factors involved. A local storage device has some
number of failure modes, and probability of failure: A. The cloud had
additional failure modes and vulnerabilities: B. The outcome is A + B. The
failure modes are additive.

~~~
macspoofing
>A local storage device has some number of failure modes, and probability of
failure: A. The cloud had additional failure modes and vulnerabilities: B. The
outcome is A + B.

That's funny =)

------
joeblau
You guys are awesome! I love your hot sauce!

------
prg318
This is a really neat idea! It would be nice to see the editor widget be re-
sizable (like an HTML text area) so that you don't have to deal with scroll
bars for larger portions of text.

Pretty impressive though!

~~~
mikelehen
This should be a pretty easy thing to add, and it's all open source... :-)

------
karl_gluck
Very cool technology! I really appreciate your decision to allow us to just
give it a try without having to set up an account or log in.

------
civilian
I think there's a bug with people's text cursors writing over eachother.

<http://imgur.com/1IvMccd>

~~~
mikelehen
Can you provide more detail? Here or firepad@firebase.com or as a github
issue? From the screenshot it's not immediately obvious to me what went wrong.

People are certainly allowed to write over each other (e.g click in the middle
of your sentence and write something). If that's not what happened, let me
know

------
saurik
So, as with most things built with Firebase, I have to ask how the security
works. Last I talked to the Firebase team, they were building expression-only
rules for managing server-side validation. This allows you to express some
reasonable subset of permissions, but not all possible ones.

In this case, the OT history for the document (required to synchronize
clients) is stored in Firebase (with each op being a separate object with a
massive ID, I imagine this will become awkward with large numbers of old
documents, but I digress). Additionally, snapshots are occasionally stored.

Rather better than previous offerings I've seen using Firebase, this
demonstration has been put together to solve the first few obvious problems: I
am not allowed to dump the set of documents[1], nor am I allowed to
arbitrarily corrupt the database by deleting random objects[2]. So far, so
good.

[1]: <https://news.ycombinator.com/item?id=4780495>

[2]: <https://news.ycombinator.com/item?id=3824775>

A set of example security rules for Firepad is actually provided as part of
the GitHub project, so we can do some analysis of the kinds of checks we will
need to bypass in order to break this particular demo ;P. (Of course, this
just makes us faster, it isn't what makes this possible.)

[https://github.com/firebase/firepad/blob/master/examples/sec...](https://github.com/firebase/firepad/blob/master/examples/security/secret-
url.json)

[https://github.com/firebase/firepad/blob/master/examples/sec...](https://github.com/firebase/firepad/blob/master/examples/security/validate-
auth.json)

Reading these, it turns out that the only verification that is being done on
the snapshots is that 1) they look reasonably valid (have the correct set of
fields) and 2) they have the correct author field associated with them (as in,
the same one that is used on the history revision item).

 _However, it doesn't do any consistency checks on the data itself._ It
doesn't even verify that the snapshot we are uploading is different than the
one currently on the server, so the problem of corrupting the state is really
easy: we just need to pull the current snapshot and modify its data.

The only pain we could run into is that it could also verify that the author
of the snapshot is the current user; but that doesn't help: all we need to do
is to make a quick edit to the document and then use our new revision (which
we legitimately own) to build our new corrupted snapshot.

That said, while that check is present in the example rules on GitHub, that
check isn't actually used in the deployed copy of Firepad on this server as
this server is entirely anonymous and thereby none of the users have any auth
information at all... we can just pretend to be other users.

For users who wish to follow along at home, you just need to have node.js
installed, and then do "npm install firebase". You can then use the following
script to destroy any document you want: you just need to set the "room"
variable to the ID # of the document you want to modify.

    
    
        #!/usr/bin/nodejs
        var Firebase = require('firebase');
    
        // parseInt(window.location.hash.substring(1))
        var room = 44;
    
        var shard = room % 15;
        var db = new Firebase('https://firebase-firepad' +
            shard + '.firebaseio.com/' + room);
    
        var check = db.child('checkpoint');
        check.once('value', function (value) {
            value = value.val();
            check.set({
                a: value.a,
                id: value.id,
                o: [''], // random data would be better
                // but I'm both lazy and busy today ;P
            });
        });
    

When the client then restores this snapshot and attempts to play back the
resulting history to "catch up" it will instead end up outputting tons of
errors to the console, as the operations stored in the history will be
referring to document positions that no longer exist or are different.

    
    
        Firepad: Invalid operation. https://firebase-firepad14.firebaseio.com/44 C2GK
    

The client then has two options: it can either skip the history entry or it
can decide the entire document is corrupt. In this case, it seems that
Firebase believed the better of the two options was to simply skip these
operations: new clients then manage to resync their state.

But, existing clients now have desynchronized state, so all operations that
are being synchronized live between the various clients on either side of this
split (ones that started from this snapshot, and ones that pre-existed it) are
going to result in this error; that didn't really help.

To be very explicit for a second: this is a different scenario than just
"well, its an anonymous system, so anyone can delete the data": we didn't just
go in and delete the data in the document, we actually corrupted the state of
the document, rendering further attempts to edit it useless.

It is currently my belief that Firebase's security rules system is simply not
powerful enough to secure an OT-based text editor, whether or not it uses
snapshots (at least assuming it supports offline; there might be tricks you
can play if all users are required to be online at all times).

(edit: I am finding it interesting that my previous security analyses of
Firebase projects, combined with example code, had been voted up quite high,
and this one has now been downvoted to 0. I wonder if people just don't care
as much about security anymore? Is it because it is open source? Is the
Firebase team themselves going around voting down? ;P)

~~~
mikelehen
Hey Saurik,

Thanks for the thorough and correct analysis as usual. :-)

The key things I would point out are that:

1) The checkpointing is an optimization. You could either remove it (which
will hurt initial load time) or delegate it to trusted server code (which will
be very lightweight; you could run hundreds of rooms off of a tiny EC2
instance or whatever).

2) In general, the whole point of collaborative editing is that you trust your
collaborators. If they're malicious, they can already cause mayhem on your
editing experience with constant edits, obscene content, etc.

~~~
saurik
1) I do not need to modify the snapshots: I can simply inject corrupt history
state. The problem is that the server in these kinds of systems is normally
supposed to be running the OT algorithm in order to verify that the data being
uploaded and stored as part of the permanent document record is valid.

(edit: That said, you would be hard-pressed to do this kind of OT-based text
editor without the snapshots, especially with the very large number of
separate objects being used to store the history state. While looking into how
you were storing the data for this in Firebase, I had tried resetting the
snapshot for a document to A0=[''], and attempting to open the document then
bogged down so far that I wasn't certain if it would even recover; this
problem will just get worse as the document ages... that only had a few hours
of history behind it: a real document would just be screwed.)

2) There is a difference between trusting your collaborators with your data,
and trusting your collaborators with your program state. Yes: if I am
collaborating with people using Google Docs, the other people can ctrl-a+del
all of the "data". However, they shouldn't be able to _break the editor
itself_ :(.

(edit:) As an example of this, if you remove the snapshots from the mechanism,
then you can make the argument that "well, if I validate and ignore all
history state that is invalid, this isn't a problem: I just need to keep the
clients in sync and skipping things that are broken is valid" (so I'm happily
willing to cede that my having added "whether or not it uses snapshots" was
going too far). I personally think that this is still a problem, as the
document record is still corrupt.

However, with the snapshots certainly, it isn't that I'm able to delete the
data: it is that I'm able to break the synchronization system itself. I can
setup situations where one party thinks they are editing the document, but
their edits are being discarded. I can make it so that one person sees a
document different than other people. In addition to doing all of this, I can
make it nearly impossible to figure out who's doing it and to fix the
situation. This is simply not the same problem as "well, you can always just
ctrl-a+del the data from the document".

~~~
mikelehen
1) The client ignores invalid history items (unless there's a bug). So while
you can pollute the Firebase data if you desire, it shouldn't affect the
behavior of the app in any way. (i.e. Other than the checkpointing thing you
brought up, you can't corrupt the history.)

That said, Firebase is certainly pushing the envelope in terms of what you
would normally do with client-only code. =] And with that comes some
challenges. In some ways Firebase is more like a peer-to-peer system than a
traditional client-server system (since the Firebase server isn't doing
complete data validation / processing). This sometimes affects the way you
write code (doing extra validation / sanitization on the client-side for
instance), but I think the advantages that come with Firebase outweigh that by
far.

~~~
saurik
Well, the alternative is something like what many of your competitors (such as
Parse) ended up deploying for handling these kinds of security situations,
which allows you to write "real code" that runs in the cloud as part of the
verification process: if you could run the OT verification algorithm on
Firebase's servers, it allows you to avoid the problem of being unable to
store trusted snapshots, but continues to offer the advantages of having
someone else manage the complexity of operating the server and handling
synchronizing the data. In such a case, the server could automatically
generate the snapshots as part of a hook that would occur when the data is
stored to the history buffer.

In this particular case, yes: you can drop the snapshots entirely, and have
the clients download and replay the entire history state in order to
synchronize as they open the document. That really isn't practical, though,
and with your current implementation it is actually painfully slow to the
point of being intolerable (although of course, you would then spend more time
optimizing that path). I continue to not be convinced that you can implement a
collaborative text editor that can be deployed in the myriad circumstances
that Firepad both seems targeted at and that other HN users are commenting on
with interest, and have it not have this problem of "users can break the
synchronization".

~~~
mikelehen
[I think HN is throttling us; I had to wait a while before a reply button
appeared. Feel free to email me (michael at firebase) if you want to continue
the conversation.]

If the standard mitigation strategies (adding authentication, banning
malicious users, etc.) aren't enough, and you're worried about people breaking
the synchronization, I agree you'd need to move the checkpointing logic to
node.js server code. Sounds like a good example app for me to write when I've
caught up on sleep and have some free time. :-)

We're also looking to do a security v2 in the future to expand on our existing
security rule capabilities and we've discussed going the "real code" route or
else allowing tighter integration with your own server-side node.js/firebase
code.

~~~
saurik
(You can just click "link" and get immediate access to a "reply" button.) As
soon as I'm setting up my own servers and having to make certain they are
secure, available, and scaling with the number of documents I have, I'm losing
a lot of the advantages of using Firebase ;P. In comparison, with a model like
Parse's, I can just push the code to them and have everything be handled
without me having to get get my hands dirty. (Also, I'm currently 12 days
behind on e-mail, but you guys can get ahold of me using other routes if you
want or need to; at least Andrew should know how to get me quickly. I'm more
just responding to the things you say here at this point, though: I have
nothing new to add.) Great to hear that you may add "run real code on the
server"!

------
macspoofing
Is it based on etherpad?

~~~
yuchi
No it's based on [Substance](<http://substance.io>) OT library, but uses
CodeMirror instead of the
[Surface](<http://interior.substance.io/modules/surface.html>)

(Substance Team member here)

~~~
unwind
By the way, unfortunately HN doesn't do Markdown. Links should just be links,
no separate display text.

~~~
yuchi
Sorry, just used to find md pretty everywhere to the point I forget HN does
not support it!

Well, then, _pardon my markdown_ :)

~~~
ianstormtaylor
I actually like reading the Markdown, it's intuitive and then I get to have
display text that makes the sentence easy to parse.

------
yuvipanda
Sweet! This will be fun to integrate into Wikipedia for editing...

/me plots

------
nahimn
The homepage turned into a giant chat room

