

Dimsum is like head or tail, but pulls randomly from the entire file or stream - snoble
http://blog.noblemail.ca/2012/11/how-to-get-random-lines-out-of-file-or.html

======
neilk
This doesn't come with an elaborate test suite, but it does pretty much
everything that does in a few dozen lines of Perl.

<https://github.com/neilk/misc/blob/master/randline>

I've had this script (or versions of it) around for more than a decade. I
didn't know the technique had a name.

~~~
andrewcooke
it was an example in the original camel book.

[edit: i was going to delete this, but since you replied i'll leave it - it
does appear (too?) in the camel book, on p 246 of my copy, but like you say,
it's for a single line. hadn't opened that book in years, took me some time to
find it...]

~~~
neilk
I believe it was an example in the Perl Cookbook, but for picking a single
line only. (Ancient UseNet thread: <http://bit.ly/Thd4eE>)

------
andrewcooke
uses reservoir sampling -<http://en.wikipedia.org/wiki/Reservoir_sampling>

(so it presumably consumes the entire stream before giving any results; any
alternative i can think of would not be "really random" unless you knew the
length of the stream in advance).

~~~
avibryant
Yep, though a feature request I've put in is to respond to a ctl-c by
producing the results from the stream so far... that way if it's taking a
while on a large file you can interrupt and still get something useful.

~~~
cgs1019
This breaks ctl-c in my opinion. When I ctl-c I want shit to stop, not dump
(potentially large quantities of) output into my terminal.

~~~
wodow
It could accept another signal (e.g. SIGUSR1) and have this clearly
documented.

~~~
neilk
I just added the SIGUSR1 feature to my hacky perl script (see my other
comment).

e.g.

    
    
        $ (while true; do cat /usr/share/dict/words; done;) | ./randline 3 &
        [2] 93937
        
        $ kill -s SIGUSR1 93937
        declinograph
        brotheler
        woolpack
    
        $ kill -s SIGUSR1 93937
        lustrify
        brotheler
        bromophenol

------
nullc
uhhh. You mean like shuf -n NNNN ?

~~~
bo1024
I wonder if the implementation of shuf would handle very large input
efficiently? Reservoir sampling wouldn't need to keep the whole input in
memory, which could be an advantage. But I don't know how shuf works.

~~~
teraflop
Doesn't look like it. I just tried running "yes | shuf -n 1" (using the latest
version of GNU coreutils, 8.20) and its memory consumption increased steadily
until I killed it.

It seems like this would be a really useful improvement, and I'm surprised
that it doesn't already seem to have been requested on the coreutils issue
tracker.

~~~
malcook
did you try "yes | dimsum -n 1"?

In my hands, `top` shows resident memory increasing steadily too....

It is perhaps more instructive to compare output from, for example

seq 1 1000000 | valgrind --time-unit=B --pages-as-heap=yes --trace-
children=yes --tool=massif --massif-out-file=massif.dimsum.100000.out.%p
dimsum -n 1

with

seq 1 1000000 | valgrind --time-unit=B --pages-as-heap=yes --trace-
children=yes --tool=massif --massif-out-file=massif.shuf.100000.out.%p shuf -n
1

in my hands, shuf is faster and uses less memory for this task.

How about you?

~~~
snoble
sigh, memory leak. It's fixed in github. When camilo is around I'll get him to
update the gem

~~~
malcook
thanks - looking forward to the patch

~~~
snoble
try a `gem update`. Memory performance should be much better now but I'm still
curious about speed

------
patrick_grant
I _really_ don't like how this behaves for populating the array initially, and
how it behaves for small inputs...

    
    
      $ seq 15 | dimsum  -n 10 
      14
      12
      3
      4
      5
      6
      7
      8
      9
      10

~~~
taltman1
Looks like a valid sample to me. Are you bothered by the ordering of the
sample members? Then I'd continue the pipeline to include a call to shuf.

