This stuff can be very counter intuitive. Locality may not be what you want.
For example, last I heard google's search index was sharded by document rather than by term.
That sounds odd, since if it was sharded by term, then a given search would only need to go to a handful of servers (one for each term) and then the intermediate result combined. But with it sharded by document, every query has to go to all the nodes in each replica/cluster.
It ends up that's not as bad as it seems. Since everything is in ram, they can answer "no matches" on a given node extremely quickly (using bloom filters mostly in on processor cache or the like). They also only send back truly matching results, rather than intermediates that might be discarded later, saving cluster bandwidth. Lastly it means their search processing is independent of the cluster communication, giving them a lot of flexibility to tweak the code without structural changes to their network, etc.
Does that mean everyone doing search should shard the same way? Probably not. You have to design this stuff carefully and mind the details. Using any given data store is not a silver bullet.
Sharding a search system by documents gives you several advantages. You can scale horizontally by adding additional indexes for new documents. You can tweak a group of documents more easily. E.g. rank wikipedia higher. Assign better hardware (if necessary), higher priority, etc. Performance is also more uniform. Easier to index content at different frequencies. It's also easier for re-indexing content after tweaking algorithms.
If you shard a search index by term instead, you will end up with duplicate documents stored in each index that contain the same term. And for a large index, you need to scale up your hardware to handle the term or shard by document within that term anyway.
That sounds odd, since if it was sharded by term, then a given search would only need to go to a handful of servers (one for each term) and then the intermediate result combined. But with it sharded by document, every query has to go to all the nodes in each replica/cluster.
Exactly. Then a flash crowd occurs and your shard fails to service requests.
For example, last I heard google's search index was sharded by document rather than by term.
That sounds odd, since if it was sharded by term, then a given search would only need to go to a handful of servers (one for each term) and then the intermediate result combined. But with it sharded by document, every query has to go to all the nodes in each replica/cluster.
It ends up that's not as bad as it seems. Since everything is in ram, they can answer "no matches" on a given node extremely quickly (using bloom filters mostly in on processor cache or the like). They also only send back truly matching results, rather than intermediates that might be discarded later, saving cluster bandwidth. Lastly it means their search processing is independent of the cluster communication, giving them a lot of flexibility to tweak the code without structural changes to their network, etc.
Does that mean everyone doing search should shard the same way? Probably not. You have to design this stuff carefully and mind the details. Using any given data store is not a silver bullet.