Not a huge company like DO, but iwStack  provides a SAN backed cloud, with selectable KVM/XEN instances, custom ISO and virtual network support. The prices are similar to DO.
I think one of the reasons is that they only have a small number of datacenters, have a small number of (very friendly) staff and are not going after the mass market like DO, and I guess they don't spend anything on marketing.
That's what AWS's "Elastic Block Storage" is. You can turn it off and just use instance storage (and I personally prefer to, for truly ephemeral nodes), but it increases spawn time since your disk image actually has to get copied over to the VM host machine in that case, rather than just "attached" over EBS.
Just because it's a SAN doesn't mean a given abstract block device from it is backed by RAID. It's literally just a multiplexed and QoSed network-attached storage cluster.
I actually prefer the lower-level abstraction: if you want a lower failure rate (or higher speed), you can RAID together attached EBS volumes yourself on the client side and work with the resultant logical volume.
On AWS, an EBS volume is only usable from one availability zone. You still need to use application-level replication to get geographic redundancy for important data, and when you have that, EBS just lets you be lazy rather than eager about copying a snapshot to local instances.
I guess I was thinking in terms of using EBS for ephemeral business-tier nodes, rather than as the backing store of your custom database-tier. (I usually use AWS's RDS Postgres for my database.)
For ephemeral business-tier nodes, EBS gives you a few advantages, but none of them are that astounding:
• the ability to "scale hot" by "pausing" (i.e. powering off) the instances you aren't using rather than terminating them, then un-pausing them when you need them again;
• the ability for EC2 to move your instances between VM hosts when Xen maintenance needs to be done, rather than forcibly terminating them. (Which only really matters if you've got circuit-switched connections without auto-reconnect—the same kind of systems where you'd be forced into doing e.g. Erlang hot-upgrades.)
• the ability to RAID0 EBS volumes together to get more IOPS, unlike instance storage. (But that isn't an inherent property of EBS being network-attached; it's just a property of EBS providing bus bandwidth that scales with the number of volumes attached, where the instance storage is just regular logical volumes that all probably sit on the same local VM host disk. A different host could get the same effect by allocating users isolated local physical disks per instance, such that attaching two volumes gives you two real PVs to RAID.)
• the ability to quickly attach and detach volumes containing large datasets, allowing you to zero-copy "pass" a data set between instances. Anything that can be done with Docker "data volumes" can be done with EBS volumes too. You can create a processing pipeline where each stage is represented as a pre-made AMI, where each VM is spawned in turn with the same "working state" EBS volume attached; modifies it; and then terminates. Alternately, you can have an EC2 instance that attaches, modifies, and detaches a thousand EBS volumes in turn. (I think this is how Amazon expected people would use AWS originally—the AMI+EBS abstractions, as designed, are extremely amenable to being used in the way most people use Docker images and data-volumes. The "AMI marketplace" makes perfect sense when you imagine Docker images in place of AMIs, too. Amazon just didn't consider that the cost for running complete OS VMs, and storing complete OS boot volumes, might be too high to facilitate that approach very well. Unikernels might bring this back, though.)
I think people should just use what they're comfortable with and what is the best fit for the problem.
My perspective to web development isn't public websites, but web servers running on embedded devices and/or workstations.
My issue is that JVM forces me to marry it. I can never make a C-ABI library with it. However I need to run same code also in places where JVM is not available or an option, that limits JVM's usefulness to me.
Other issues I have is it's much harder (and often impossible) to extract maximum performance out of JVM. It's true JVM is as fast as naive C/C++ code and much safer too. There are things you can do to slightly improve performance, but it often means the code is far from idiomatic Java (or whatever language it is that compiles to bytecode).
I like how JVM land can dynamically optimize on the fly according to runtime information. How it can inline at runtime across different .jar "libraries".
But in C-ABI land I can ensure memory locality of reference, directly map objects to memory mapped file, and use SIMD and other processor features for maximal effect. I can deal with multi-socket NUMA issues and allocate CPU local memory. There are cases where SIMD gives 10-40x boost (like sorting code) and a ton of cases where you get 2-10x gain.
I do hate the days when some piece of code poops all over stack and heap and all you have is a corrupted core dump to work with.
Of course I can use JNI, but then I have two problems.
JVM banana is tasty, but the gorilla and jungle comes with it.
Rust apple looks promising and so far it seems like I get only the apple I need. On top of that it looks like in addition to memory safety, I'm also getting compile time concurrency guarantees. That's something JVM/Java never gave me.
If one day Rust could provide me dynamic runtime optimization features, that might be all I'll need.
JVM's gc is most likely significantly better. On the other hand golang's gc needs to collect less objects, in some cases orders of magnitude less.
If you compare a slice of structs with 1000 elements, it'll be one object (and allocation) in golang. Equivalent array in JVM requires the array itself + 1000 Objects, 1001 allocations. In this case, golang has lot less object graph to gc.
Of course slice of 1000 interfaces or pointers faces the same 1001 issue in golang as well.
You could emulate same gc load cost in JVM at cost of runtime performance by storing the objects in a byte array and [de]serialize as needed, but that's neither idiomatic or acceptable solution most of the time.
Although Go's GC is tunable to some extent, the open source HotSpot JVM are already has multiple GC implementations that you can choose based on your use case and further tune. There is also work being done in the OpenJDK project for a GC that can collect > 100GB heaps in < 10ms . There are also alternative proprietary implementations available today that already have no stop the world collections 
If it is carefully tuned why it needs such a big GC tuning guide and 100s of JVM flags to tune runtime. Any Java product of consequence comes with custom GC settings meaning they do not find default ones suitable.
Because the JVM developers had customers who asked for the ability to tune the GC for their particular application.
Go will receive those feature requests too. The Go developers may not be as willing to provide so many knobs (which is a position I'm entirely sympathetic to, don't get me wrong). But the settings always exist, regardless of whether Google hammers in values for them or leaves them adjustable. GC is full of tradeoffs; they're fundamental to the problem.
The G1 collector has a single knob that is supposed to be a master knob: you pick your pause time goal. Lower means shorter pauses but overall more CPU time spent on collection. Higher means longer pauses but less time spent on collection and thus more CPU time spent on your app. Batch job? Give it a high goal. Latency sensitive game or server? Give it a low goal.
There are many other flags too, and you can tune them if you want to squeeze more performance out of your system, but you don't have to use them if you don't want to.
Depends what you compare it to. As I have written above, you can get low pause times with huge heaps today. In practice very few apps need such low pause times with such giant heaps and as such most users prefer to tolerate higher pauses to get more throughput. There are cases where that's not true, the high frequency trading world seems to be one, but that's why companies like Azul make money. You can get JVMs that never pause. Just not for free.
With respect to garbage collection only and ignoring things like reliable debugging support, the primary thing it does is compaction.
If your memory manager does not compact the heap (i.e. never moves anything), then this implies a couple of things:
1. You can run out of memory whilst still technically having enough bytes available for a requested allocation, if those bytes are not contiguous. Most allocators bucket allocations by size to try and avoid the worst of this, but ultimately if you don't move things around it can always bite you.
2. The allocator has to go find a space for something when you request space. As the heap gets more and more fragmented this can slow down. If your collector is able to move objects then you can do things generationally which means allocation is effectively free (just bump a pointer).
In the JVM world there are two state of the art collectors, the open source G1 and Azul's commercial C4 collector (C4 == continuous compacting concurrent collector). Both can compact the heap concurrently. It is considered an important feature for reliability because otherwise big programs can get into a state where they can't stay up forever because eventually their heap gets so fragmented that they have to restart. Note that not all programs suffer from this. It depends a lot on how a program uses memory, the types of allocations they do, their predictability, etc. But if your program does start to suffer from heap fragmentation then oh boy, is it ever painful to fix.
The Go team have made a collector that does not move things. This means it can superficially look very good compared to other runtimes, but it's comparing apples to oranges: the collectors aren't doing the same amount of work.
The two JVM collectors have a few other tricks up their sleeves. G1 can deduplicate strings on the heap. If you have a string like "GET" or "index.html" 1000 times in your heap, G1 can rewrite the pointers so there's only a single copy instead. C4's stand-out feature is that your app doesn't pause for GC ever, all collection is done whilst the app is running, and Azul's custom JVM is tuned to keep all other pause times absolutely minimal as well. However IIRC it needs some kernel patches in order to do this, due to the unique stresses it places on the Linux VMM subsystem.
While I agree that compaction is desirable in theory, empirically it's not really necessary. For example, there are no C/C++ malloc/free implementations that compact, because compaction would change the address of pointers, breaking the C language. Long-lived C and C++ applications seem to get by just fine without the ability to move objects in memory.
Java code also tends to make more allocations than Go code, simply because Java does not (yet) have value types, and Go does. This isn't really anything to do with the GC, but it does mean that Java _needs_ a more powerful GC just to handle the sometimes much greater volume of allocations. It also makes Java programmers sometimes have to resort to hacks like arrays of primitive types (I've done this before).
People like to talk about how important generational GC is, and how big a problem it is that Go doesn't have it. But I have also seen that if there is too high a volume of data in the young-gen in Java, short-lived objects get tenured anyway. In practice, the generational assumption isn't always true. If you use libraries like Protobuffers that create a ton of garbage, you can pretty easily exceed the GC's ability to keep up with short-lived garbage.
I'm really curious to see how Go's GC works out for big heaps in practice. I can say that my experience with Java heaps above 100 GB has not been good. (To be fair, most of my Java experience has been under CMS, not the new G1 collector.)
Experience with C/C++ is exactly why people tend to value compaction. I've absolutely encountered servers and other long-lived apps written in C++ that suffer from heap fragmentation, and required serious attention from skilled developers to try and fix things (sometimes by adding or removing fields from structures). It can be a huge time sink because the code isn't actually buggy and the problem is often not easily localised to one section of code. It's not common that you encounter big firefighting efforts though, because often for a server it's easier to just restart it in this sort of situation.
As an example, Windows has a special malloc called the "low fragmentation heap" specifically to help fight this kind of problem - if fragmentation was never an issue in practice, such a feature would not exist.
CMS was never designed for 100GB+ heaps so I am not surprised your experience was poor. G1 can handle such heaps although the Intel/HBase presentation suggested aiming for more like 100msec pause times is reasonable there.
The main thing I'd guess you have to watch out for with huge Go heaps is how long it takes to complete a collection. If it's really scanning the entire heap in each collection then I'd guess you can outrun the GC quite easily if your allocation rate is high.
It's true heaps can be a pain with C/C++. 64-bit is pretty ok, it's rare to have any issues.
32-bit is painful and messy. If possible, one thing that may help is to allocate large (virtual memory wise) objects once in the beginning of a new process and have separate heaps for different threads / purposes. Not only heap fragmentation can be issue, but also virtual memory fragmentation. Latter is usually what turns out to be fatal. One way to mitigate issues with multiple large allocations is to change memory mapping as needed... Yeah, it can get messy.
64-bit systems are way easier. Large allocations can be handled by allocating page size blocks of memory from OS (VirtualAlloc / mmap). OS can move and compact physical memory just fine. At most you'll end up with holes in the virtual memory mappings, but it's not a real issue with 64 bit systems.
Small allocations with some allocator that is smart enough to group allocations by 2^n size (or do some other smarter tricks to practically eliminate fragmentation).
Other ways are to use arenas or multiple heaps. For example per thread or per object.
There are also compactible heaps. You just need to lock the memory object before use to get a pointer to it and unlock when you're done. The heap manager is free to move the memory block as it pleases, because no one is allowed to have a pointer to the block. Harder to use, yes, but hey, no fragmentation!
Yeah, Java is better in some ways for being able to compact memory always. That said, I've also cursed it to hell for ending up in practically infinite gc loop when used memory is nearing maximum heap size.
Well, I can only say that your experience is different than mine. I worked with C++ for 10 years, on mostly server side software, and never encountered a problem that we traced back to heap fragmentation. I'm not sure exactly why this was the case... perhaps the use of object pools prevented it, or perhaps it just isn't that big of a problem on modern 64 bit servers.
At Cloudera, we still mostly use CMS because the version of G1 shipped in JDK6 wasn't considered mature, and we only recently upgraded to JDK7. We are currently looking into defaulting to G1, but it will take time to feel confident about that. G1 is not a silver bullet anyway. You can still get multi-minute pauses with heaps bigger than 100GB. A stop-the-world GC is still lurking in wait if certain conditions are met, and some workloads always trigger it... like starting the HDFS NameNode.
"Great" is pushing it a bit. The JVM will not do inter-procedural escape analysis unless the called method is inlined into the callee and so the compiler can treat it as a single method for optimisation purposes. So forget about stack allocating an object high up the call stack even if it's only used lower down and could theoretically have been done so.
That said, JVMs do not actually stack allocate anything. They do a smarter optimisation called scalar replacement. The object is effectively decomposed into local variables that are then subject to further optimisation, for instance, completely deleting a field that isn't used.
Value types will be added to the JVM eventually in the Valhalla project. Go fans may note here that Go has value types, but this is a dodge - the bulk of the work being done so far in Valhalla is a major upgrade of the support for generics, because the Java (and .NET) teams believe that value types without generic specialisation is a fairly useless feature. If they didn't do that you could have MyValueType as an array, but not a List<MyValueType> or Map<String, MyValueType> which would make it fairly useless. Go gets around this problem by simply not letting users define their own generic data structures and baking a few simple ones into the language itself. This is hardly a solution.
It is straightforward because Go has first class value types which are most likely to be on stack vs Java where everything except primitives are reference type which are most likely to be on heap. Also Java data structures are really bloated.
The golang object pool is a bit of a problem (compared to the JVM alternatives) due to lack of generics. You tend to need to do object pooling when you have tight performance requirements which is at odds with the type manipulation you have to do with the sync.Pool.
So the golang pool is good for the case where you have GC heavy but non-latency sensitive operations, but not the more general performance sensitive problems.
At the x86 level, myPool.Get(Index) is going to be at least as expensive as cmp/jae/mov (3 cycles), and myPool.Get().(myStruct) is going to be at least as expensive as cmp/jae/cmp/jne/mov (5 cycles). So unless you have some way of hiding the latency, the type check is 67% slower by cycle count.
The experience of every JIT developer is that dynamic type checks do matter a lot in hot paths.
Not disagreeing, but I think that was a bit inaccurate.
If that branch is mispredicted, we're talking about 12-20 cycles. Ok, I assume it's a range check and thus (nearly) always not taken. So if it's in hot path, it'll always be correctly predicted. Modern CPUs will most likely fuse cmp+jae into one micro-op, so predicted-not-taken + mov will take 2 cycles (+latency).
"cmp/jae/cmp/jne/mov" will of course be fused into 3 micro-ops. But don't you mean "cmp/jae/cmp/je/mov"? I'm assuming second compare is a NULL check (or at least that instructions are ordered that way second branch is practically never taken). I think that also takes 2 cycles (both branches execute on same clock cycle + mov), but not sure how fused predicted-not-takens behave.
L3 miss for that mov, well... might well be 200 cycles.
Ah yeah, I wasn't sure if fusion was going to happen. You're probably right in macro-op terms; sorry about that.
The first compare is a bounds check against the array backing the pool, and the second compare is against the type field on the interface, not a null check. Golang interfaces are "fat pointers" with two words: a data pointer and a vtable pointer. So the first cmp is against a register, while the second cmp is against memory, data dependent on the register index. The address of the cmp has to be at least checked to determine if it faults, so I would think at least some part of it would have to be serialized after the first branch, making it slower than the version without the type guard.
So your had rolled one will not have the typing overhead we are discussing, but it will have 2 much worse issues.
1. sync.Pool's have thread local storage, something your own pools will not have.
2. sync.Pool's are GC aware; meaning if the allocator is having trouble it can drain "free" pool objects to gain memory. Your custom pool will not have this integration.
I have a feeling that the performance you gained not type-checking you will loose by not having #1.
I think you are missing my point. So I'll restate it. sync.Pool does not help with GC issues compared to the JVM because the JVM also has object pools, further those object pools are actually better for the low latency case because the language does not force them to make a choice between dynamic type checks and specific use abstractions.
 As pcwalton points out. My whole argument is actually null and void due to type erasure...doh.
Yeah, and they can collect the unreachable objects in the array when no references exist to the array itself.
From the QCon talk linked to in the slides it sounds like the Go GC is benefiting from the reduced number of objects being allocated and the fact that those objects can never move. Makes things a lot simpler if you can get away with it, but I can imagine a reference into an array, and the inability to move objects could combine in bad ways if you're unlucky.
> So unless you did profile, your comment didn't really add anything.
The previous comment explained why and how the author's analysis is flawed. Then it goes over top by speculating exactly into the opposite direction. But that doesn't reduce the quality of the first part of the comment. That part was still valuable to me.
Original author's analysis is flawed, it doesn't extend to multi-socket systems.
What we don't know is how thing scales beyond a single socket. The graph is not going to tell us anything about that. Profiling will.
Experience has shown to me assumptions are bad.
So now I don't assume a certain call will succeed or even work the way I think it does. Instead I check return value of every call and test it against my assumptions.
I never assume much about performance either. I might occasionally use microbenchmarks as a hint. But the main mode of operation is measuring as big of a piece of functionality as possible. Many different size, but realistic, workloads. Preferably on multiple different systems as well.
If performance is the goal, I'd advocate one trying out different concurrent hash map implementations in the system one is building.
> The graph is not going to tell us anything about that. Profiling will.
What kind of profiling would you employ for a concurrent data structure like the hash map here? Instrumented code or sampling profilers?
I'm afraid both kinds of profiling would yield quite meaningless information, because the individual operations are quite fast and the runtime may vary depending on e.g. cache utilization and contention. Profiling is good for bigger applications but it's not super useful for "primitive" operations like hash map inserts.
If I were to optimize something like this, I'd first reach for the CPU performance counters trying to understand what aspect is the bottle neck.
A sampling profiler would be helpful. Although most hash map related samples would likely fall on atomic ops it presumably uses for synchronization. On the other hand, you'd know whether you need to optimize this in the first place.
Instrumented profiler would yield garbage data for a lot of reasons, I wouldn't use that, except maybe over a large group of hash map operations.
> ... I'd first reach for the CPU performance counters
Overhead of std::mutex depends a lot on contention patterns. Assuming one x86 CPU socket, usually you can take + release an ideally optimized contended mutex about 10M-25M times per second, if that's all you do, 100% CPU usage on all cores. That's simply limited by how many times you can run contested LOCK XADD (x86 atomic fetch-and-add) / LOCK CMPXCHG* (x86 atomic compare-and-swap) instructions per second.
Mutex generates a lot of cache coherency traffic and saturates the CPU internal ring bus. NUMA case it'll of course quickly saturate lower bandwidth CPU external QPI link(s).
Contested mutexes perform a lot better on a typical laptop than on a server.
OK, now I see what you are saying. It would certainly be true for a spin lock, but it seems to me that, for std::mutex, the kernel transition required to arbitrate contention would dominate even on the coherency traffic, right?
You're right that context switches would dominate std::mutex. For some reason I was thinking std::mutex uses spinlocks, even though I knew better. Too much kernel side coding lately - context switches are often not possible there, so mutexes are often not possible as potential synchronization mechanisms.
You might still be right. The implementation of std::mutex is unspecified, and real world implementations will probably spin a bit before going into the kernel, so for your test case with a tiny critical section it might behave like a spinlock.
It can be hard or even impossible to prove that a variable gets read before it gets written to.
People who want to get every last bit of performance out of their systems do want to write code like:
if(condition1) x = expression1;
x = DoSomethingWith(x);
where the intelligent reader knows things are fine, but a compiler has a hard time proving it. For example, a contract implicit in an API may enforce the invariant, but that cannot be deduced from the code.
> Although, IMHO, code with a reference to block scope uninitialized variable shouldn't even compile.
Funny story, that is actually what caused Debian's weak keys fiasco. The code was using uninitialised memory to seed the PRNG and someone removed it because of valgrind warnings. So sometimes UB is exactly what you want. :P
In high level parallel programming, say for SIMD or GPU (which really is just a fancy name for a wide SIMD engine with a ton of hardware threads to hide memory latency and a 2D-oriented gather/scatter memory controller), I wish there was a way to still expose the situation when your data is in reality shared by multiple "cores", when they're actually just different lanes in a same SIMD/WARP/whatever unit. Right now you just pretend they're separate, even when they can't even branch individually.
Example: Think about image processing algorithm, say 3x3 kernel. If you have 16-wide SIMD, using 3 registers, you only need special handling for the first and last pixels, only they cross the processing boundary. 14 pixels in the middle already have all of the data. If you can plan your memory accesses, you can do even better by shifting a pixel in from each new 16 pixel fetch.
Through abstraction you need to gather the 9 (3x3 kernel) center+neighboring values and run clamping for each neighbor access, even if you're not at the boundary of the image data. Even when a good SIMD/GPU compiler notices the data is already loaded, you still end up with needless saturation arithmetic (clamp) and most likely extra memory loads, potentially many times as many.
Sure, swizzling for locality of reference and a good cache controller help, but extra ALU work and memory accesses still mean sub-optimal performance.
This is where I've gotten good gains (up to 2-5x) when manually writing SIMD asm or intrinsics.
That's a lot of performance to leave on the table, given that CPU and GPU performance hasn't improved much in last 3 years.
This year we will get HBM1/2  + more ALUs to use up that bandwidth for GPUs.
Next year AVX-512 for x86 CPUs. I guess HBM comes to also CPUs in a few years.
My guess is after that next significant improvements come maybe 2020-2025, if the current performance improvement trend holds.
Next step is probably CPUs absorbing GPU duties. So far memory bandwidth has been an issue, but now you can stick HBM chips just as well in a CPU package as in a GPU package. Just need some lower clocked ultra-wide cores.