

Optimising Garbage Collection Overhead in Sigma - psibi
https://simonmar.github.io/posts/2015-07-28-optimising-garbage-collection-overhead-in-sigma.html

======
ezyang
I mentored Giovanni on the two-step allocator patch, and in the process we
discovered that on Windows, page tables for all of your reserved address space
are counted towards the memory limits of your process. This also affects Go
([https://golang.org/issue/5402](https://golang.org/issue/5402) and
[https://golang.org/issue/5236](https://golang.org/issue/5236)) which uses a
similar trick of reserving virtual address space for its heap. If anyone has
ideas for how to deal with this on Windows I think many people would be quite
interested :)

------
amelius
> Sigma, which is part of the anti-spam infrastructure

Sounds like something which can be done as a batch-job with little memory. Or
not?

~~~
thoughtpolice
Sigma is quite advanced; it's essentially an online DSL where authors can push
anti-spam rules (written in Haskell) into live production by reloading code at
runtime. Queries are done against Sigma in real time by other services, and in
turn, Sigma has to query a lot of other data sources continuously in order to
determine whether a rule may fire.

For example, a particular rule about the nature of some of your friends on
Facebook may need to query 10 different data sources (different DBs, caches,
monitor infrastructure). One of the really nice things about Sigma is that
it's built on Haxl, a library for efficient concurrent data access. It can
also optimize the typical 'N+1 Query problem' away.

What this means is you can write a program like:

    
    
      ids := getAllUserIds  -- fetch from source 1 time
      foreach id as ids {
        x <- getUserFriends id -- N queries, 1 for each id
        ...
      }
    

Which is simple and naive, yet Haxl can optimize this automatically into a
program that will A) batch all of the data accesses together (so instead of
running N queries for each ID, each query gets batched into one request for a
range of users), B) automatically access each data source concurrently with no
programmer intervention, so when queries can execute in parallel they do so,
and C) cache the results, so that you aren't re-querying already fetched data.

There's a very good paper by Simon, the author of this blog post, discussing
the design of Haxl and its use:
[http://community.haskell.org/~simonmar/papers/haxl-
icfp14.pd...](http://community.haskell.org/~simonmar/papers/haxl-icfp14.pdf)
Quite the neat system!

Note: I do not work at Facebook, but I do chat with Simon a bit - this is
basically the very high level 20,000 foot view from what I've read from Simon
writing on the subject.

