Thanks! So on each batched load, is the previous data rewritten with interleaved new data? Or is the key ordering such that's never necessary?

Each batched load has no ordering. But the data I'm loading is not the same as the data I'm reading.

The data I'm loading is stuff like tags - e.g., <itemid>\t<tagid>. In human terms, "Dress A has a ruched collar." Mapreduce can handle data like this, even when it comes unordered.

The data I'm reading is computational results based on the loaded data - e.g., an index: <tagid>\t[<itemid1>, <itemid2>, ...] (where each itemid has been tagged with tagid). E.g., "here are all the dresses with a ruched collar."

(Actually, we do considerably more than this, nor do we need Hadoop for an index. But an index is the simplest example I could give.)

The original data is very boring. It's only after aggregation and calculation that it becomes worth reading.

