So, assuming by "scaling" in the web space, you mean "be able to quickly respond to many reads and writes as my app generates" and not "handle large volumes of data", then you can do what LinkedIn does, which is to just copy the exact same graph to multiple machines, and rely on the fact that eventual consistency is good enough. You make things immediate for the user most concerned with immediate feedback (make them sticky to the graph their write updated) and for everybody else, a couple minutes of lag is no biggie. This is even more true in retail and other applications.
Scaling in the web space usually means both; many reads/writes and large volumes of data.
Yes, mirroring may work to a point but falls down eventually in write-heavy applications. Ideally you want something that you can just add machines to and it will scale near lineary. I'm not sure if that's entirely achievable for a graph search, but that's where my question was heading.
Yes, mirroring as I described will fall down when the throughput of writes into the network exceeds the ability of a single machine to just write, because you will never have the opportunity to "catch up."
However, that is a lot of data to be writing into your graph, especially with how crazy fast writable media and storage is getting these days. YAGNI.
Now, on the theoretical side of things.. "How would you create a linearly scalable graph database across machines?" I don't know how I would make it so that I could maintain the same kinds of speeds for interesting graph traversals.
You of course can do it since all modern web search is graph-based. The real question, and one I don't have an answer to, is at what point do the additional performance of multiple nodes begin to trump the induced network latencies.
Splitting things is the relatively easy part -- it's building the consistency model for multi-node systems that's tricky.
For read-write partitioning it's pretty simple -- each item is largely independent; it has its columns of data and is handled by an index, so once you're just reading / writing / updating items, it's no problem to do the hashing from key to index then using that index to locate the appropriate node.
The devil is of course in the details. If we see that barrier approaching we'll plan ahead for scaling out this way.
However, just doing some quick calculations, looks like the English wikipedia gets 5.4 billion page views per month, which translates to about 2100 per second. On my MacBook I get an average query time on our profile dataset for wikipedia's graph of 2.5 ms per query -- meaning 400 requests per second, extrapolating from there, scaling that up to 6x that on a hefty server doesn't seem unreasonable, and that ignores the fact that we could go further in caching the results (since most of them would be duplicate requests) to push that number up even higher.
So, yeah, it's an issue that's in the back of our heads, but not one we're currently dreading.
No, not in our profile set, but because of our locking system (where readers don't block writes and writes don't block reads) I believe it should hold up well under moderate write conditions (in the case of recommendations applications, we assume that writes are infrequent relative to reads).
Still an untested assumption, but the system is architectured to hold up well in those situations and I think that it will reasonably scale there.