
Show HN: Radix sort big files in memory - afiodorov
https://github.com/afiodorov/radixmmap
======
antender
Was there any external requirement to specifically sort this file in-memory
only? Why not just split the file into chunks (around 100MB), sort them as
usual, and then k-way merge sort them after that. This, in theory, can be
faster than allocating lots of RAM for radix tree, especially if you use SSD
instead of HDD.

~~~
antender
After consulting with GNU sort manual: sort has -m option just for the case of
merging presorted files, so you can test this by using 'split -l', then 'xargs
sort' (to parallelize), then 'sort -m' to merge chunks

~~~
afiodorov
I agree that a significant proportion of time is spent on IO. Only 8m38s is
actually spent sorting (out of 19m37s). However in the past my experiments
have shown that using `sort -m` to sort chunks is much much slower than using
`sort -S100%`.

------
BubRoss
I'm not sure what the point of this is, it is something basic but implemented
poorly.

\- Saving the start and end position of a string that represents a date with
16 bytes is silly. Just convert the date to 64 or 32 bit integer (or even less
depending on the granularity and range of the dates).

\- run through the file converting the dates to a smaller integer
representation. When the array of integers is too big, sort it and use that to
write a sorted text file chunk

\- once that is done, merge the text file chunks together

Any good sorting algorithm is going to be able to do this in under 20 minutes
with 16 cores. If IO is a bottleneck it would be much worse while trying to
swap around text lines in a giant memory mapped file.

~~~
jasonwatkinspdx
> Just convert the date to 64 or 32 bit integer (or even less depending on the
> granularity and range of the dates).

We use rfc3339 because there's a lot of gotchas that appear when you conflate
civil timekeeping with an absolute time scale. The number of seconds in a day
is not constant, and generally only known a few months ahead of a coming leap
second. Civil timekeeping syntaxes can deal with this gracefully exactly
because they're a structured representation rather than absolute nanos
relative to some epoch.

~~~
BubRoss
They only need to be converted one way so that sorting is fast. You might be
overthinking this.

------
TheTank
Thanks for sharing. Do I understand correctly that this requires loading the
whole file in memory along with an ordered list of keys? Or is it just the
first n bytes that are loaded in memory? If the former, then it seems very
expensive in terms of RAM, particularly if your data file has multiple
columns.

As an alternative I used is to load the file in a database, then sort by the
key I want (which only loads the key in memory) and then output the result
into a file. It does go through disk but you can address larger files as you
only need the key in memory, and not the whole file.

~~~
tleb_
It does a memory map (mmap). Your file is addressed in virtual memory. See
[https://en.wikipedia.org/wiki/Mmap](https://en.wikipedia.org/wiki/Mmap) and
[https://en.wikipedia.org/wiki/Memory-
mapped_file](https://en.wikipedia.org/wiki/Memory-mapped_file)

------
jepcommenter
As you sort by first field anyway, could you please try out omitting field
split (-t, -k1)? For me it gives a noticeable improvement:

$ stat --printf="%s\n" p.csv

1258291200

$ time sort -t, -k1 -S100% -o sorted.csv p.csv

real 0m50,186s user 4m6,962s sys 0m4,562s

$ time sort -o sorted.csv p.csv

real 0m43,483s user 3m36,473s sys 0m4,282s

~~~
loeg
Where did you find the dataset, or did you construct your own?

~~~
afiodorov
It's a dataset related to balance changes of bitcoin addresses downsampled to
daily resolution.

You could extract it from BigQuery's bitcoin public data.

------
known
What would be the result when we add these sort options

    
    
        sort -f -s --batch-size=1024 -T/home

~~~
afiodorov
Could you elaborate more why this could be faster? These experiments take some
time to complete. Also should I not use --parallel flag?

~~~
known
Sorry I meant to tune sort

    
    
        LC_ALL=C sort --parallel=16 -t, -k1 -S100% -f -s --batch-size=1024 -T/home /tmp/test

~~~
afiodorov
I had to reduce the batch-size to 512, because otherwise sort complained with
batch-size too large. The timings are:

    
    
      real    30m54.557s
      user    82m9.055s
      sys     3m1.723s
    

no improvement over sort without the batch-size...

------
helltone
Out of curiosity, what is the entropy of your input? ie what's the output of

gzip input.csv -c | wc -c

versus

cat input | wc -l

?

