

Using Erlang, C And Lisp To Fight The Tsunami Of Mobile Data - bugsense
http://highscalability.com/blog/2012/11/26/bigdata-using-erlang-c-and-lisp-to-fight-the-tsunami-of-mobi.html

======
bsaul
Why using a self-made lisp here ? Why not using an existing one ? Is it
because regular lisps use a VM ? But then what's the tradeoff ?

On a side note, i really don't know if the savings will be worth it once the
company starts growing and hiring. They'll spend a lot of time maintaining
very custom code, running on a custom implementation of a language, alongside
erlang codebase... Unless they open source everything to get a community to
grow the user base for their tool.

~~~
stinkypete
Probably has something to do with this statement, repeated many times on the
page:

    
    
      We are able to do this, because starting our LISP-VM and doing all these processes on every
      request, is still many times faster that having a VM (with a garbage collector) online all the time.
    

Self-made LISPs usually aren't that complicated to write, and if you're taking
shortcuts like preallocating chunks of memory rather than writing a full-blown
VM GC system, then things become even easier. My guess is that most of the
LISP code doing the data analysis is composed of calls into primitives written
in C that do all the heavy lifting so the performance hit of an interpreted or
bytecode-interpreted language is minimal.

Having to maintain a small LISP interpreter in C is definitely an extra thing
to keep up with, but you have to balance that with the work that would
normally go into avoiding long GC pauses once you start tracking a nontrivial
amount of objects/data in a VM with garbage collection. Java, for instance,
doesn't support separate heaps, which means that a background piggy processing
thread that is haphazardly allocating objects can cause "core" threads doing
I/O to pause for several seconds for full GCs. Even with Java's fairly
sophisticated heap management schemes it is still _very_ difficult to design a
system to completely avoid full GC pauses. Erlang is somewhat better in this
respect but introduces its own sets of problems to the mix.

In any case, I agree with your sentiment (probably would not be the angle I
would have taken), but I can kind of see why the authors might have decided to
go this route.

edit: formatting fixes

~~~
bsaul
Since garbage collecting seems such a great issue, does anyone knows of an
effort to have an objective-C-like language with automatic reference counting
and memory retain/release (like with clang) on the server side ?

PS : i'm speaking objective-c here because it's the only language i know that
does it that way, not because of its features as a language.

~~~
krichman
I'm pretty sure CPython reference counts objects. Also Objective-C -- as in
GNUStep, if you aren't using a Mac server.

EDIT: There's a downside to reference counting in that either you or the
runtime does the retain/release calls _very_ often, whereas a mark-and-sweep
collector would just touch each object once per collection. So reference
counting might not be faster overall, but typically avoids long pauses. What
you probably really want is something like Azul's JVM, which has a kernel
extension so they can collect memory concurrently, resulting in shorter pauses
than sweeping and faster overall time than reference counting.

------
vemv
As it turns out, they previously were using Clojure + Python.
[http://googleappengine.blogspot.com/2012/02/bugsense-
hybrid-...](http://googleappengine.blogspot.com/2012/02/bugsense-hybrid-app-
experiences-using.html)

~~~
jonromero
We are still using it for some other parts.

------
nnq
It would be nice to know how many man-hours it took them to get this system
from 0/idea to production and how many people were in the team...

~~~
jonromero
Two people, prototype in 2 months (CTO of BugSense here)

~~~
mulander
Is there a specific process you follow? Lean/agile whatever buzzword that is
closest to describe how you go from idea to delivery.

How close do you work with your clients and how much is your system changing
in response to feedback?

What drawbacks (if any) the current tech stack choices had so far on your?

~~~
jonromero
Having a small (but very strong) R&D team helps you eliminate most of these
"processes". The buzzword here is "iterations". Lot's of them.

Our clients were involved since day 0 (which I think is the key of our success
btw). The first prototype (built in two days) was in Erlang using Mnesia.

The drawbacks are: \+ not having lots of engineers knowing C and Erlang \+
Outsiders having problem with LISP (that's why we implemented PIG)

~~~
imperialWicket
This is a great write-up. Off-topic, but I'm wondering what bottlenecks you
encountered with the straight Erlang + Mnesia prototype?

~~~
jonromero
Mnesia is by far the fastest (and easiest to use) thing we've ever tried. The
only problem was that it was more expensive to keep data in memory and that
pointer arithmetic is so damn fast. And (the most important thing) is the 2GB
limit Dets has :/

------
chubot
I wonder why they didn't just use Lua or some other embeddable language rather
than writing their own Lisp? Is Lua's GC too invasive?

------
zhoudh
I am interest in the missing part: what data structures/layouts are used
inside LDB? Cause for In-Memory DB, CPU cache miss plays a key role in
executing query. And how much data are touched to execute a single query in
common case?

~~~
dgk42
Cache misses are not compared to initializing the lisp engine. So, we don't
worry so much about them. Also, since the data organization is up to us, we
touch only what is necessary (well, almost! We are talking about fine-grained
locking here, but we modify (mutate) only the datum that needs to be changed).

------
taltman1
I don't understand why the data is unstructured, and thus SQL is not an
option. Isn't the data coming in from calls to the company's own API?

~~~
dgk42
It has to do with semantics and the way you view your data. SQL is not a
Turing-complete language (without abusing it!) and represents a subset of
Relational Algebra. For us, relations might seem to resemble these of an RDBMS
at first sight, but we work with them in a different way.

Also, the data format for input is not tightly-coupled with the LDB system.
One can have different processing scripts (lisp) that may organize the data in
a different way. Or you can write a C module that extends the underlying db
engine to work with your implementation or even your database infrastructure
(whether it includes SQL, NoSQL, etc).

~~~
andor
I'm not sure if I understand your answer. To quote from the article again:

 _During our tests, we saw that SQL databases weren’t a good fit due to the
fact that our data where unstructured and we needed a lot of complex “joins”
(and many indexes)._

I guess that you say it's unstructured because the most important parts of bug
reports are stack traces and debug messages? My naive assumption is that stack
traces can be stored in a relational way, and it seems that you do:

 _For us, relations might seem to resemble these of an RDBMS at first sight,
but we work with them in a different way._

From 'unstructured', 'complex "joins"' and 'work in a different way' I infer
that you do a lot of string processing on the fly, and the real problem was
about precomputing everything vs. staying flexible?

------
martinced
Self-made Lisp, in-memory database and hacking the Linux kernel: what's not to
like about it ; )

I'm wondering: how does one decide to change the TCP_TIMEWAIT_LEN from 60
seconds to 20 seconds? 20 seems an arbitrary picked value: I understand that
by doing so they can drastically cut down on the number of open connections
after the last FIN is received, but why 20 instead of, say, 17 or, say, 9?

~~~
matthavener
On linux echo 20 > /proc/sys/net/ipv4/tcp_fin_timeout

~~~
saurik
The question was not "how do you do it once you decide" but "how do you decide
on such an arbitrary replacement".

Also: "Do not be confused by the /proc/sys/net/ipv4/tcp_fin_timeout config
item. The FIN TIMEOUT is not the same as the TIMEWAIT length." --
<http://www.stolk.org/debian/timewait.html>

~~~
priestjim
The value was picked heuristically after measuring through Ganglia and a set
of CLI tools the average TCP dialog duration between a set of mobile devices
and our database and then performing a set of tests with varying TIMEWAIT
lengths. The 20 second period is a sweet (and round :)) spot with < 0.01% of
packets / day arriving after FIN. It's a good approach all in all, since our
expected payload packet set is extremely small (just 1 on broadband
connections with large MTUs).

\- Panagiotis Papadomitsos, Head of Infrastructure @ BugSense

~~~
krichman
I was hoping it got chosen empirically. It's awesome that they were wise
enough to not pick some theoretical best and calling it a day.

------
dschiptsov
Should we call this example code a Spaghetti Scheme?)

