

Compact Off-Heap Structures and Tuples In Java - pacoverdi
http://mechanical-sympathy.blogspot.co.uk/2012/10/compact-off-heap-structurestuples-in.html

======
pacoverdi
Funny, I was actually manipulating C structures from Java, and after hitting a
nasty issue, went procrastinating on twitter where I stumbled upon this blog
post :)

For those interested, there are less low-level ways to access fields in
structures, for example Javolution
[http://javolution.org/target/site/apidocs/javolution/io/Stru...](http://javolution.org/target/site/apidocs/javolution/io/Struct.html)

I'm working on a very similar lib but where Structs are stateless, allowing to
use them concurrently from multiple threads.

~~~
jmount
Just a quick question. For the flyweight idea in the article you get
additional thread unsafeness even for immutable data in that you are changing
where the flyweight points (which is how you want to do it to avoid creating
and destroying a lot of flyweight objects). What is your idea to pool/control
multiple flyweight accessors?

~~~
pacoverdi
[Not sure whether the question is directed to the author or to my comment but
I'll answer anyway :) ]

Basically, there are 3 notions

1) the buffer (byte[] or whatever) containing the data, immutable at least
while the data is being accessed by a given thread

2) the Struct object: defines the structure of a message (field length,
offset, arrays etc.)

3) the Message (aka the flyweight) that points to a given offset in buffer and
is bound to a Struct instance

The buffer and the message are normally accessed by only one thread, but
several threads can share Struct instances (i.e. the dictionary).

Each field (even deep in the field hierarchy) knows its offset relative to the
beginning of the message, so values can be accessed like eg.

    
    
      int foo = struct.foo.getInt(msg);
    

The only complicated thing is when dealing with arrays (especially when nested
in other arrays). The message does the bookkeeping necessary to safely access
elements, ie. the mutable state is stored in the flyweight.

------
lmm
Pretty much as expected. I'd be interested to see a comparison with object
pooling (obviously not appropriate to the algorithm as written, but in a real
system you're more likely to be streaming in data in chunks rather than
putting it all in a big buffer and then reading through it), which lets you
get a lot of the performance advantages of avoiding GC without completely
abandoning Java's safety guarantees.

~~~
mjpt777
How would it work with object pooling if I wanted to query a large table of
data? This is often needed in real big data applications.

~~~
lmm
For that kind of problem I'd probably be using hadoop, which does object
pooling internally with the objects it passes into your mappers/reducers.

For a non-hadoop datasource you could do the same thing by hand: stream in the
data from the table, turning it into objects from your pool and passing them
through to your reducer function in small batches.

~~~
mjpt777
Interesting. It sounds like your issues are IO dominant since you do not mind
the JVM startup cost from Hadoop for each query on each node. I'm more often
looking at large data that is all memory resident which tends to drive the
design this way. In finance queries need to have latencies way below sub-
second which Hadoop cannot come close to satisfying. This is comparing batch
to real-time analytics.

~~~
lmm
You're right that most of my big-data experience is batch work, and outside of
finance. I guess I'm finding it hard to envision the kind of data where you'd
want to work on the whole set, but that set's small enough to fit into memory
- for real-time analytics wouldn't you be wanting to stream data and reduce it
to the representation you want as it comes in?

------
chii
That is both scary, and so very intriguingly interesting!

------
jeffffff
you can do this on heap with byte[] using Unsafe.arrayBaseOffset and
getInt(Object, long), putInt(Object, long, int) and friends

