Hacker News new | past | comments | ask | show | jobs | submit login
How We Found a Missing Scala Class (heapanalytics.com)
85 points by drob on Sept 12, 2018 | hide | past | web | favorite | 43 comments

FYI this domain is blocked by default for uBlock users.

Archive.org is getting a 403 forbidden nginx.

Archive.is worked: http://archive.is/kKcRV

Ya, I see. And Google and Wayback machine have not yet cached the issued url content.

Wild-assed kazinator guess: probably be a substring/regex match on "analytics". Must be a tracking domain!

It's on the Pete Lowe Adserver list directly, since Heap is in fact a tracking system and domain: https://pgl.yoyo.org/as/serverlist.php?showintro=0;hostforma...

While Heap Analytics indeed is tracking software, clicking the "temporarily unblock" button allowed me to read a pretty good detective story.

Heap CTO here – would love to answer any questions you have.

This was my first exposure to btrace, which a super useful swiss army knife for JVM debugging. That made this a worthwhile adventure for sure.

Do you think Scala already surpassed C++'s ability to obfuscate code, or does it need more improvements to get there?

Great article. I had been hunting down a similar issue in Java 8 with maven dependencies. I was getting the same error, but confirmed that the dependent jars were correctly included on the classpath. I eventually gave up and decided to take a different programming approach that did not include these missing classes, but I think I will revisit it to see if a class loader is getting closed somehow.

Ooh, check it out and let me know what you find! It would make me really happy if this post helped someone debug something when they had previously hit a dead end.


I'm curious how do you guys operate the flink cluster, do you have a single huge shared flink cluster where people can submit any kind of jobs for various applications/streams. Or do you have multiple smaller flink clusters for specific use cases? Are they on Mesos/YARN/k8s, or just plain vms/baremetals?

Which flink version are you guys on? I'm very excited about Flink 1.6, especially related with FLIP-6 [https://cwiki.apache.org/confluence/pages/viewpage.action?pa...]

Just wondering if you guys have any thoughts on that.

Thank you.

We're running a single flink cluster. Engineers can run whatever jobs they need, but we aren't writing new flink jobs that often so the load is pretty predictable. We have a single digit number of jobs at the moment.

We are on 1.3.2 at the moment, running on EC2 vms.

How much time did your team spend on the debugging effort?

Iirc Ivan (post author) spent a few days tracking this down. There were some other debugging dead ends that we omitted in this writeup. One red herring was that the issue appeared to happen during the US morning, so there was some time-of-day component, and we thought it might be a system load issue.

The fix turned out to be fairly involved too – on the order of a week I think.

Ivan works from Bulgaria so sadly he is asleep right now.

Isn't that a bit odd, your stack trace there in the article?

Usually the stack trace when a NoClassDefFoundError is thrown contains a "Cause" clearly showing the name of the class loader that was supposed to know about the class in question, and which then rather obviously failed to load it. If it actually isn't in the real trace, the exception/error logging is probably a bit wonky.

I have seen logging procedures that don't traverse/print the entire traceback chain, but only prints the message and the immediate stack. But this is unfortunately a rather terrible idea. Quite often it will exclude the actual cause from the printed trace while simulataneously retaining the error message, not rarely leading to liberal amounts of confusion. The pattern with exceptions being thrown with a cause is not uncommon in the JDK, so making sure to log causes are important.

In general, although it's probably obvious, I would like to mention that having code loaded by class loaders with different lifetimes interact is rife with "interesting" issues. Wherever possible I would recommend serializing messages over any boundaries where class loaders have different lifetimes, as it both prevents all of the strangest causes of errors, and can also lead to a cleaner design. An exception would be if prohibitively expensive from a performance perspective, of course.

NoClassDefFoundError? But it’s right there!

Although in this case the cause was very different, it reminds me of an old "trap for young players" with loading shared libraries dynamically --- the library itself can exist and be readable and executable, and yet attempting to load it fails with a "file not found" error. This happens when one of its dependencies, directly or indirectly, is missing.

I ran across another similar-yet-very-different example of this once in a completely different language.

In my case, we added a new class, it worked fine on dev, then failed on staging (and would have failed in production if we'd let it go that far). This was confusing, because we were using Vagrant to ensure our dev and staging environments were identical. What could be going on?

Well, our linux VMs were being hosted in OS X, with shared folders for the code, and by default OS X volumes are not case sensitive. Meanwhile the linux staging and production environments were using actual linux filesystems, which were case sensitive. So someone added a new class MyFancyClass, then tried to import it as MyFancyclass, and it worked great in dev since the underlying FS of the host OS could find the file, then failed on staging.

A fun debugging ride, and a good reminder that 1) having dev and staging the same is really important and 2) that might be harder than you think. :)

> a good reminder that 1) having dev and staging the same is really important

Actually, I think it proves that having staging and PROD the same is particularly important.

Oh yes, that too!

I recall, back in the mid-to-late 2000s, investigating OutOfMemoryError on a Windows server, to find that the cause was an inability to create a thread, because the kernel buffer space for managing threads was exhausted ( many thousands of threads created )


I've had this one a few times recently, and every time, it stumps for for a few minutes!

Yes, I know the details, no I never remember them when it happens to me ;)

Ah! This was one of my very, very least favorite things about developing win32 DLLs, way back in the day.

It sounds like you have a lot of operation issues due to the technologies that you used. I mean, at least you aren't doing your backend in node, but running an actor system on top of an actor system is going to be brutal to properly analyze once you actually have scale.

What sort of process do you have for picking trendy technologies vs tested ones, and how much do you talk to people who have built large scale systems before implementing things like scala?

Yeah... Reading this, it smacked of a possible combination of poor tool choice and over-engineering (which I've been guilty of plenty). I built a video processing/workflow application in Scala with Akka a few years ago and debugging that was hard enough, eventually it was refactored to a simpler Kotlin/Spring application... Actor systems are great for certain use cases but you can really hurt the transparency of your app if you aren't careful. I can't imagine maintaining the OP's application at scale for this use case, but maybe they have someone smarter than me!

Counterpoint: debugging erlang systems in production is a cakewalk. The tracing and introspection tools that come bundled in OTP make tracking problems down really easy. It's really hard to go back to systems that don't have erlang level visibility, so much so that it's kind of a crutch sometimes.

This is an ecosystem problem and not something inherent in a program using an actor abstraction.

Ah yes, that is a great point. Erlang/OTP were designed to be used as actor systems, whereas the actor implementations in Scala and other JVM languages/frameworks are at least one level of abstraction above that. Definitely agreed on Erlang/OTP having a wonderful set of tools for debugging/visibility, but I still stand by my assessment that OPs problems are from over-engineering (and secondarily from the ecosystem).

Definitely. In the handful of instances I've seen akka used, regular jvm abstractions would have been simpler and more coherent.

Systems in general tend to have a way that they like to be used. Erlang has all of the tools to support that model, and from my limited experience with it, works great when you respect it. I wouldn't reach for it due to my inexperience, but if I had someone like you on my team, I'd love to learn.

Java also has really great tooling.. when it is used like it wants to be. I don't have enough erlang experience to know if it is true there or not, but with Java, I've found that smart but green devs (like it sounds like Heap has) tend to reach for systems that solve a lot of their initial problems and cover up their initial ignorance, but then trade it for operational problems down the road. More experienced developers tend to build systems that are boring, explicit, take longer to get to Beta, but don't require a pager.

drob mentioned that they aren't writing flink jobs very often, which makes me think that they are probably using it for some sort of rollups / stream processing of their analytics data. If the business logic of those is complex, they'll probably have a bad time with outliers that fuck up their cardinality assumptions. If it isn't super complex, they probably didn't take the time to model their data correctly with boring java classes, and keep pushing complexity into the interactions between their actors.

Either way, they would have a lot faster and easier to maintain system if they forced themselves to pretend to be stuck with java 1.6 except where 1.8 stuff improved performance and readability (basically avoid abusing streams and reflection)

Scala is not exactly a new trendy technology at this point; it's being used in industry quite widely.

Why doesn't java just spit out a classLoaderClosed error?

Yes, believe that was the bad part from JMV implementation!

I keep running in this type of problems all the time with our developers. Please keep it simple. Take a step back and ask yourself, do i need all this stuff, is this the best approach. Often they just blindly accept all the external libs. For me as an old school guy, i don't trust all those dependencies at all.

How do you lose a class in a programming language?!

The title is misleading. The class was there all along, the NoClassDefFoundError was thrown because the class loader was closed when trying to load a class.

But why can you even do that?

It says right there in the article.

The moment the article mentioned "Fat jar" I knew that'd be the problem.

I don't recommend using any type of fat jar plugin (like OneJar) or even Google Guice for that matter. Custom class loaders are a nightmare.

Thanks to Docker containers, you should never really need a far jar again. Just find a decent Docker packager for your build system (sbt, gradle, etc.) and it can plop all your dependencies in there in a nice, isolated container that uses the standard class loader.

Fat jars are evil, but there's no need for Docker: Just collect the dependency jars in a lib/ (or whatever) folder and explicitly give them on the classpath when running the 'java' executable. We use the sbt 'pack' plugin where I work and it works a treat.

(Docker has its place, but it's massive overkill just to avoid fat jars.)

I was thinking of doing this. We have a Scala app that we deploy as a fat jar in a docker container. But if we instead shipped the individual files, the diffs between docker builds would be much smaller. Still, the time to build and deploy the docker container is so small it hasn't been worth the effort yet.

What is problematic about Fat jars?

The problem seemed to be Flink's implementation to unload the FatJar's classloader when erroring. This would have happened with slim jars as well, wouldnt it?

I also dont see how docker relates exactly, you can have hundreds of library jars in a classpath with standard classloaders, no docker required

I'm not sure about whether this particular problem has anything to do with fat jars, but there a couple of really big annoyances with fat jars which have bitten me occasionally:

* If you flatten the classpath completely when building the fat jar then you need to somehow be able reconcile duplicate classpath entries[1]. (Duplicate class path entries are perfectly within spec as far as I can tell -- at least as long as they are from distinct jars. Not sure if they're allowed in a single jar.)

* If you use nested jars (+ custom classloader perhaps) then it becomes impossible to refer to classpath entry resources in nested jars in a standard way (via java.net.URL, that is). Sometimes you can use foo.jar!bar.jar/blah, but even with that non-standard syntax I've encountered at least one case where it was impossible to refer to a doubly-nested classpath resource (don't ask). The lack of standard support for nesting jars seems like an oversight in the class loading/resource API, but there it is.

There's probably more, but that's at least a couple of the ones that come to mind.

[1] For example the plugin data file used by log4j 2.x where you have to have custom merge logic to handle that specific (binary!) file. You might blame this on log4j, but as a practical matter it's hard to avoid it. This is a pretty rare scenario, but any custom build logic or special-case plugins can be a huge pain for maintenance.

Also Scala's lambdas create new anonymous classes, but Java's lambdas are kind of bootstrapped static methods in the existing class file.

It does get converted to a class at runtime, but then it's long past the classloader anyway.

Scala 2.12 lambdas compile to Java 8 lambdas.

Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact