

When memory is not enough - dzeban
http://avd.reduct.ru/programming/external-sort.html

======
virmundi
I think that memory limited apps are more common, at least when they're
bootstrapped. Look at Digital Ocean, Linode or DreamHost. You get a VPS for
$20 bucks with 2 GB of ram. My phone has 2 GB. My laptops have 16 GB. Life is
different on these types of machines.

This makes picking tech rather difficult. On the one hand C++ or D would work
well here since they both offer good memory management (D's hybrid approach is
nice). They are incredibly heavy (and D confuses me. The collection wars still
come: up Tango vs Phoebe). So i don't know how productive I'd be given that
I'm the only person in my startup. On the other hand the JVM can tune itself
and even Clojure so maybe memory wouldn't be an issue for a higher level
language.

Picking a database is hard too. I want to use NoSQL (wrote a book on it,
should use it). I like ArangoDB. Sadly, like MongoDB it is mmapped. There are
no efficient, swanky memory management tricks like you get in an older model
SQL engine. But it makes sharding/replicating a breeze. So rather than having
to tweak the crap out of MySQL or Postgres and get into the engine replacement
(eventually), I take a hit up front. I just have to vertically scale ArangoDB
for a while until it makes more sense to replicate/shard.

~~~
tracker1
I made my own comment.. but even with GC based environments, if you can
control when GC happens you can set it up to run more frequently than the
system controls in place. My node processing scripts tend to stay under 50mb
(usually closer to 20), and services are usually pretty light as well.

You don't necessarily need to resort to a lower level language to avoid using
too much memory.

------
tracker1
For node.js processes, like importer/exporter scripts and other timed scripts,
I actually use the command line flag to expose garbage collection, and
manually call after each item... it keeps the memory allocation in check (20mb
instead of climbing over several hundred before done processing), and it does
a pretty nice job.

For services, I'd be more inclined to not call it as much, but given that all
timed event scripts run on one server currently, and potentially
simultaneously, I prefer it this way.

Even with gc based environments one should check on their actual memory usage.
I worked on a game engine backend in .Net that would actually hang up for
several seconds when GC ran... forcing GC more often kept things running a lot
smoother. A fraction of a second for my use was an acceptable lag, but several
seconds wasn't.

------
lamacase
Bit of a tangent but, I don't know if a segfault should be considered a
consequence of a "naive" implementation. Seems more like a consequence of a
broken one.

A simple if (!A) goto err_free; would turn that segfault into a "Failed to
parse file"

Also, it looks like qsort is implemented on his system as qsort_r:

    
    
      http://osxr.org/glibc/source/stdlib/msort.c
    

Qsort_r appears to allocate O(n) extra memory when the data fits into 1/4 of
physical memory. The comments suggest that the regular qsort is slower.

~~~
dzeban
I guess, you're right - for naive implementation it should fall with error,
not a segfault. Maybe I just add a check for realloc result.

As for qsort implementation - I was confused by glibc sources ;-) What I've
found was a slow but in-place implementation, and what you've shown me is an
actual implementation that invokes an in-place variant when allocation was
failed:

    
    
      tmp = malloc (size);
      __set_errno (save);
      if (tmp == NULL)
        {
          /* Couldn't get space, so use the slower algorithm
             that doesn't need a temporary array.  */
          _quicksort (b, n, s, cmp, arg);
          return;
        }
    

Thanks for feedback!

