Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Which may be useful for tape drives, but I'm not sure what else.

Partitioning large data sets, or accessing any data store where sequential access is more efficient than random access. Tape drives are merely a canonical and previously-common example of such data stores, but such situations do still exist in the modern world, even if they are not as common. I for one was rather happy to see this, as it addresses a problem I have that I did not know that there was already a solution for!

Suppose you have a huge dataset that you cannot load into memory all at once, but you can request samples of via a paging API. You want to sample a random subset of the records in that dataset. Just generating the indices won't do--you need to turn those indices into actual records. Random access imposes too much overhead--even if you can request arbitrary pages, you don't want to have to request each page more than once, and you don't want to have hold previously-requested pages in memory Just In Case you decide to back up and select another record from them later. Ergo, a non-ordered permutation function doesn't give optimal performance.

With ordered sequential sampling, you can request one page at a time, get all the records you will ever need out of that page, then move on to the next page that would contain an index from your sample.

Now, if your sample size is small enough that you can fit the list of card-indices into memory, rather than just streaming the actual selected records off somewhere else for processing (which in my case, it is), then you could achieve equal data-access performance by selecting the head of a random permutation and then sorting it--but as the article mentions, that kills your asymptotic performance for computing the indices, and means you can no longer perform selection on-line, or parallelize the operations of index selection and data retrieval.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: