Archive.is worked: http://archive.is/kKcRV
This was my first exposure to btrace, which a super useful swiss army knife for JVM debugging. That made this a worthwhile adventure for sure.
I'm curious how do you guys operate the flink cluster, do you have a single huge shared flink cluster where people can submit any kind of jobs for various applications/streams.
Or do you have multiple smaller flink clusters for specific use cases?
Are they on Mesos/YARN/k8s, or just plain vms/baremetals?
Which flink version are you guys on?
I'm very excited about Flink 1.6, especially related with FLIP-6 [https://cwiki.apache.org/confluence/pages/viewpage.action?pa...]
Just wondering if you guys have any thoughts on that.
We are on 1.3.2 at the moment, running on EC2 vms.
The fix turned out to be fairly involved too – on the order of a week I think.
Ivan works from Bulgaria so sadly he is asleep right now.
Usually the stack trace when a NoClassDefFoundError is thrown contains a "Cause" clearly showing the name of the class loader that was supposed to know about the class in question, and which then rather obviously failed to load it. If it actually isn't in the real trace, the exception/error logging is probably a bit wonky.
I have seen logging procedures that don't traverse/print the entire traceback chain, but only prints the message and the immediate stack. But this is unfortunately a rather terrible idea. Quite often it will exclude the actual cause from the printed trace while simulataneously retaining the error message, not rarely leading to liberal amounts of confusion. The pattern with exceptions being thrown with a cause is not uncommon in the JDK, so making sure to log causes are important.
In general, although it's probably obvious, I would like to mention that having code loaded by class loaders with different lifetimes interact is rife with "interesting" issues. Wherever possible I would recommend serializing messages over any boundaries where class loaders have different lifetimes, as it both prevents all of the strangest causes of errors, and can also lead to a cleaner design. An exception would be if prohibitively expensive from a performance perspective, of course.
Although in this case the cause was very different, it reminds me of an old "trap for young players" with loading shared libraries dynamically --- the library itself can exist and be readable and executable, and yet attempting to load it fails with a "file not found" error. This happens when one of its dependencies, directly or indirectly, is missing.
In my case, we added a new class, it worked fine on dev, then failed on staging (and would have failed in production if we'd let it go that far). This was confusing, because we were using Vagrant to ensure our dev and staging environments were identical. What could be going on?
Well, our linux VMs were being hosted in OS X, with shared folders for the code, and by default OS X volumes are not case sensitive. Meanwhile the linux staging and production environments were using actual linux filesystems, which were case sensitive. So someone added a new class MyFancyClass, then tried to import it as MyFancyclass, and it worked great in dev since the underlying FS of the host OS could find the file, then failed on staging.
A fun debugging ride, and a good reminder that 1) having dev and staging the same is really important and 2) that might be harder than you think. :)
Actually, I think it proves that having staging and PROD the same is particularly important.
Yes, I know the details, no I never remember them when it happens to me ;)
What sort of process do you have for picking trendy technologies vs tested ones, and how much do you talk to people who have built large scale systems before implementing things like scala?
This is an ecosystem problem and not something inherent in a program using an actor abstraction.
Java also has really great tooling.. when it is used like it wants to be. I don't have enough erlang experience to know if it is true there or not, but with Java, I've found that smart but green devs (like it sounds like Heap has) tend to reach for systems that solve a lot of their initial problems and cover up their initial ignorance, but then trade it for operational problems down the road. More experienced developers tend to build systems that are boring, explicit, take longer to get to Beta, but don't require a pager.
drob mentioned that they aren't writing flink jobs very often, which makes me think that they are probably using it for some sort of rollups / stream processing of their analytics data. If the business logic of those is complex, they'll probably have a bad time with outliers that fuck up their cardinality assumptions. If it isn't super complex, they probably didn't take the time to model their data correctly with boring java classes, and keep pushing complexity into the interactions between their actors.
Either way, they would have a lot faster and easier to maintain system if they forced themselves to pretend to be stuck with java 1.6 except where 1.8 stuff improved performance and readability (basically avoid abusing streams and reflection)
I don't recommend using any type of fat jar plugin (like OneJar) or even Google Guice for that matter. Custom class loaders are a nightmare.
Thanks to Docker containers, you should never really need a far jar again. Just find a decent Docker packager for your build system (sbt, gradle, etc.) and it can plop all your dependencies in there in a nice, isolated container that uses the standard class loader.
(Docker has its place, but it's massive overkill just to avoid fat jars.)
The problem seemed to be Flink's implementation to unload the FatJar's classloader when erroring. This would have happened with slim jars as well, wouldnt it?
I also dont see how docker relates exactly, you can have hundreds of library jars in a classpath with standard classloaders, no docker required
* If you flatten the classpath completely when building the fat jar then you need to somehow be able reconcile duplicate classpath entries. (Duplicate class path entries are perfectly within spec as far as I can tell -- at least as long as they are from distinct jars. Not sure if they're allowed in a single jar.)
* If you use nested jars (+ custom classloader perhaps) then it becomes impossible to refer to classpath entry resources in nested jars in a standard way (via java.net.URL, that is). Sometimes you can use foo.jar!bar.jar/blah, but even with that non-standard syntax I've encountered at least one case where it was impossible to refer to a doubly-nested classpath resource (don't ask). The lack of standard support for nesting jars seems like an oversight in the class loading/resource API, but there it is.
There's probably more, but that's at least a couple of the ones that come to mind.
 For example the plugin data file used by log4j 2.x where you have to have custom merge logic to handle that specific (binary!) file. You might blame this on log4j, but as a practical matter it's hard to avoid it. This is a pretty rare scenario, but any custom build logic or special-case plugins can be a huge pain for maintenance.
It does get converted to a class at runtime, but then it's long past the classloader anyway.