Hacker News new | past | comments | ask | show | jobs | submit login
3 billion items in Java Map with 16 GB RAM (kotek.net)
105 points by qwerta on Nov 11, 2012 | hide | past | favorite | 37 comments

The MapDB mentioned in this article looks pretty interesting: https://github.com/jankotek/MapDB. Anyone have experience with it?

I've been using the previous version of the library (JDBM3) for some time now, although I haven't tried MapDB just yet. They have very similar APIs, however.

I've found JDBM3 a pleasure to use; it's fast and stable, with an excellent API. It's really quite easy to start using (it exposes Java collection interfaces), but you can configure it to do some powerful things under the hood. I'm not using the library in a really high-performance setting (just need a persistent key-value store), so I can't quite comment on the extreme scaling qualities.

I'm excited to switch to MapDB as it matures.

I've used JDBM3 to build a high-speed geocoder from US Census Tiger data. Because the data set is static and read-only, I use a minimal perfect hash function to map zip|street pairs to ids and then JDBM to map those ids to compressed strings of data. I have < 25M entries in JDBM, so not an envelope-pushing use case.

I'm on an old XP machine with very little memory and a slow disk, and I get quite good performance out of it, despite not being able to effectively use mapped or off-heap memory. I expect that moving to MapDB on JDK7 (with its concurrent collections and better memory allocation) will improve performance quite a bit.

From Github:

"What you should know: MapDB relies on mapped memory heavily. NIO implementation in JDK6 seems to be failing randomly under heavy load. MapDB works best with JDK7"

The original JDBM2 apparently worked fine under Android; I wonder how the new implementation (JDBM4 renamed MapDB) fairs.

I do worry about Java as implemented on Android diverging from the official stuff in general. For example, Android does not have NIO.2, and Harmony NIO bugs would be different than OpenJDK NIO bugs.

Can someone please explain how this is possible, when all the java collections have "int size()" and the max value of a java int is 2,147,483,647?

Collections are allowed to grow larger than size() can handle. The documentation for size() says:

> If this collection contains more than Integer.MAX_VALUE elements, returns Integer.MAX_VALUE.

I see, thanks for pointing it out. I always considered it a fundamental limitation of collections, because all or almost all of them are array-backed, and array indices in java are int, therefore the collections are generally limited to 2^31 items.

Does the term "array-backed" include the possibility of using an array of arrays? That would get you up to about 2^62...

"Near the end insertion rate slowed down thanks to excesive GC activity"

Can someone explain why this happens?

The article doesn't say what the -Xms was. If you set it to less than Xmx, the VM will increase the memory size only when it's really necessary - it will issue a GC before trying to increase it. A GC on a heap of 100M objects takes a while.

Even when you reach the Xmx the VM will try to GC, and sometimes it can free a bit of memory so it doesn't throw an OOM, insert another 10 items and run the GC again, and so on - as long as you allocate small objects, it can take a while until you see the OOM.

With each consecutive insert in the collection, the available memory shrinks, the GC detects this, and attempts to free some of it. As long as more memory (the 15GB) can be allocated, this isn't much of a problem. However, at some point there is hardly any memory left: the 15GB is all but allocated, and very little memory can be freed with each run. So the GC runs ever more often, to the point that it starts to slow down the system.

Before you say "well that's stupid", realise that the alternative would throwing an OutOfMemoryError, in which case the JVM would terminate. A slow JVM is better than no JVM.

I disagree. A slow JVM can go on and on and the perception may be that the system is up. I think in this scenario a fail fast is much more interesting, since in a normal system, the fact that the JVM is down would trigger some manual intervention which seems like is needed anyways.

I might like the JVM to fail-fast rather than plod along in a bogged down state, but if I did, I wouldn't want it to wholesale crash (throwing an OutOfMemoryException), but instead have my monitoring system start to throttle the work its doing and get other processes to help out (scaling).

Indeed, you can do this; the JVM has simple facilities to report the memory use and heap status, which you can monitor and use to proactively scale around the bogged-down VM.

You can get such functionality in Oracle's JVM by using the command line option -XX:+UseGCOverheadLimit.

This will cause OOM-exception to be thrown when at least 98% of program time is used in garbage collection and at most 2% of the total memory is reclaimed in such collections.

Java GC is one of the finest pieces of engineering. It can be configured and tweaked in million[1] ways to provide one with the behaviour that he wants to see.


The behavior you want, I think, kind of depends.

Regardless, since Java 6 (I think), if the GC time becomes excessive (where "excessive" means x% of time is spent in GC and less than y% of memory is reclaimed in each GC cycle), an OutOfMemoryError will be thrown. This can be disabled.

Failure as default is usually not a good policy for a runtime system.

Failure as a default is how Erlang manages to keep systems up with 5+ 9's reliability. The difference is that, with Erlang, the entire system and the applications built on it are designed around that.

When there is high memory pressure, the default behavior in Erlang is to fail rather than continue garbage-collection?

The Java GC uses a mark and sweep algorithm that has to stop the world to collect the garbage. As the number of items on the heap goes up, there are a higher number of valid pointers (less garbage) and the GC takes a long time to find even a little bit of garbage to clean up. So each GC run takes longer and there are more of them as the heap fills up. Given that the world is stopped during each GC run things take longer and hence the insertion rate is slower.

Java's garbage collector has advanced from pure stop-the-world mark-and-sweep collection over the years. In Java 7, a new generational garbage collector became the default, working concurrently with application threads to minimize GC pauses and collection time on multicore systems. If you're curious, check out these resources about how it works:

* http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-co...

* http://www.oracle.com/technetwork/java/javase/tech/g1-intro-...

Note also that the JRE ships with multiple garbage collectors that you can select manually, and that some of the collectors tune themselves to match the application.

What? In java there are multiple GCs available: ConcMarkSwep, G1GC, ParallelGC, and I believe a couple more. All you have to do is set the right CLI flag.

From the concurrent mark and sweep docs: "The concurrent mark sweep collector, also known as the concurrent collector or CMS, is targeted at applications that are sensitive to garbage collection pauses. It performs most garbage collection activity concurrently, i.e., while the application threads are running, to keep garbage collection-induced pauses short. The key performance enhancements made to the CMS collector in JDK 6 are outlined below. See the documents referenced below for more detailed information on these changes, the CMS collector, and garbage collection in HotSpot."

It does have to stop the world sometimes but those times should be quite rare!

If the CMS collector can't 'keep up', meaning it doesn't think it can avoid running out of heap space based on its duty cycle and the current memory state, it will do a full on complete, stop the world GC. Based on the article I'm guessing this is what is kicking in.

They never mentioned using CMS, and CMS isn't the default GC, so I assume they weren't using it.

CMS actually isn't the highest throughput GC, it's there for low-pause systems where there are extra cores available to perform GC in the background.

My point was that none of the GCs completely avoid full on stop everything, sweep and compact all the things GC cycles. They will all do this if they are in a position where they think the heap will be exhausted (specifically when a concurrent mode failure occurs). There really isn't anything else a GC can do in this situation other than throw an OOM exception. Thus any time your pushing the heap to its limits, regardless of what GC you use, you will see complete GC sweeps of the old generation.

If you want stricter failure requirements use GCTimeLimit and GCHeapFreeLimit. By default the JVM with throw an OOM error if it spends 98% of execution time in the GC and frees only 2% of the memory, GCTimeLimit and GCHeapFreeLimit switches allow this these values to be changed. Also only the 'stop the world' portions of a collection apply to the execution time limit, concurrent phases do not. So if you managed to keep the collector in concurrent mode (by cranking up the duty cycle for example) then only the 2 mini-pauses will count even if the concurrent portion were now pegging a CPU core at 100%.

Exactly. If there is too many pointers GC becomes very slow.

It is kind of a dumb test to do 'new HashMap' with the default initialCapacity then cram stuff into it in a tight loop.

It is basically testing constant rehashes which is obviously going to generate a ton of garbage

I suspect the default collection implementations would perform significantly better if they were used correctly

They still suck even if you set a large initialCapacity.

I expect that as memory usage increases, the resulting increase in GC pressure causes the GC to run more often. Most GCs that I know of are designed to have some memory free for the young generation (for fast, cheap allocations) and some memory free in the old generation (so objects can be moved from the young generation to the old generation). Maybe the JVM detects that it's approaching the limits of physical memory and starts GCing more often to avoid running out of space?

EDIT: From rereading the original post it seems more likely that it was approaching the heap size limit set when running the JVM and not a physical memory limit. I would expect the same principle applies, though.

Does anyone know which license MapDB is distributed under (GPL, BSD, etc.)? I couldn't find this information on the MapDB site.

Interesting, though I'm curious about what kind of results you get when something other than an empty String object is used?

It uses more memory and result number of entries will be smaller. It is database engine, physical laws still applies.

I was more interested in knowing whether you have any real world examples of MapDB?

MapDB is still too fresh, so there is no documentation or examples yet. But some people are already using it, mostly for parsing large text files.

Previous versions were called JDBM and it has been around since 2001, so it has some solid user base. Problem is that nobody tells me until there is bug, which is not happening that often :-)

I'll say for JDBM3, the bugs must be few and far between; we haven't found any yet. Many thanks for a rock-solid library!

Does anyone know of a .net implementation?

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact