TensorTinkerer's comments

TensorTinkerer · on Sept 12, 2023

Interesting concept to handle missing data using Trinary decision trees. At a high-level, it seems reminiscent of Multiple Imputation in randomForests which could address missingness. Though the Trinary tree takes a different approach by not presuming the missing values harbor any significant information about the response. It's intriguing that it shines in MCAR settings, but falls short with Informative Missingness.

> "Notably, the Trinary tree outperforms its peers in MCAR settings, especially when data is only missing out-of-sample, while lacking behind in IM settings."

This somewhat mirrors the behavior of early imputation strategies. One must ponder, however, how the Trinary tree would perform vis-a-vis older methods like CART's surrogate splits or C4.5's probabilistic splits for handling missing values. These older methods were crafted with an intuition somewhat similar to the Trinary tree.

It's also great to see the amalgamation of Trinary tree with the Missing In Attributes approach into the TrinaryMIA tree. But the efficacy of this hybrid model isn't completely surprising. MIA has historically shown resilience in diverse missing data scenarios, and combining that with the Trinary's approach could harmonize their strengths.

What would be really enticing is to see if the essence of the Trinary decision tree can be injected into boosting models like XGBoost or LightGBM. Since these models are notorious for their treatment of missing values, maybe there's some potential symbiosis there?

micro_cam · on Sept 12, 2023

I implemented something like this in a [pre xgboost boosting framework](https://github.com/ryanbressler/CloudForest) ~10 years ago and it worked well.

It isn't even that much of a speed hit using the classical sorting CART implementation. However xgboost and ligthgbm use histogram based approximate sorting which might be harder to adapt in a performant way. And certainly the code will be a lot messier.

Macuyiko · on Sept 12, 2023

Came here to cite your work, I even mention "CloudForest" in my slides still as "an interesting implementation that is also capable of handling NANs in DTs in a slightly different way." Crazy this has already been 10 years.

TensorTinkerer · on Sept 10, 2023

Really digging the list, especially the emphasis on simplicity in systems. Reminds me of that age-old principle: "What is the simplest thing that could possibly work?" I've been on teams where we battled convoluted systems. Sometimes a refactor, even a daunting one, clears so much technical debt and revitalizes the team.

The 'ask "why"' bit? Gold. It's not just about coding but understanding the bigger picture. Feedback loops in software? Amazing. Feedback in career growth? Equally crucial.

It's a solid reminder that our game isn't just code; it's the soft skills too. They can seriously make or break your trajectory. But as always, while guides are great, everyone's path in tech is unique. It's like code; it evolves, iterates, and adapts.

TensorTinkerer · on Sept 8, 2023

While the article elucidates well on the intricacies and challenges of async Rust, I feel it's crucial to note that one of Rust's core philosophies is ensuring memory safety without sacrificing performance.

The async patterns in Rust, especially with regards to data safety assurances for the compiler, are emblematic of this philosophy. Though there are complexities, the value proposition is a safer concurrency model that requires developers to think deeply about their data and execution flow. I do concur that Rust might not be the go-to for every massively concurrent userspace application, but for systems where robustness and safety are paramount, the trade-offs are justifiable. It's also worth noting that as the ecosystem evolves, we'll likely see more abstractions and libraries that ease these pain points.

Still, diving into the intricacies as this article does, gives developers a better foundational understanding, which in itself is invaluable.