Hacker News new | past | comments | ask | show | jobs | submit | cpard's comments login

I love to write, actually I think I have to write as it’s the only way I’ve figured out on how to put guardrails on my thoughts.

I got my first remarkable a few years ago and I was super excited, I thought it could be the bridge between my need to write and the digital world.

I gave up, I also tried an iPad too but again I gave up.

I ended up using a cheap fountain pen and the paper that I like its texture.

I think the problem with all these devices is that from a product perspective they focus on the wrong things.

I don’t care about colors and syncing with the cloud or whatever else.

I care about emulating an as close as possible experience to natural writing and that means latency of the device and the tactile feeling I get when I touch the screen with the pen are the most important aspects.

I haven’t seen much there happening and maybe these are just too hard problems to solve.

Or maybe I’m just a member of a too niche group of people.

But until I find a digital writing instrument that gives me the sensory feedback of a pen an a paper I don’t see me going back to these devices.


I'm not sure why you are getting downvoted, probably because people feel you are advertising Modal.

But I have to say something about Modal. The difference with this vendor is that they try to reimagine the way people build on the Cloud and it's worth checking out just to see how different the developer experience could be.

I know that most people use it because of the easy and affordable access to GPUs, but I think we are missing the true innovation here, which is the developer experience.

I would even consider Modal as a cloud infra product, although a vertical one, more than an ML or DE product.

*edited to fix some spelling*


Didn't realize it was downvoted, but fair enough if people feel it's too much of an ad. Comment is sitting at 2 points now :)

Glad you really get what we're trying to do with Modal. You're right it's not just an easy way to get serverless GPUs.

Modal is reimagining software development practices for the cloud era. Developing in the cloud should not be just writing YAML or Hashicorp Config Language templates, push/pulling Docker images, and re-running 'infratool up' over and over until things over.


Hey, thanks for the feedback! There is a reason the article has the structure it does.

I'm going through the main Pilars of MLOps and explaining how they overlap with data engineering.

Also, the title says, "Mostly" not "just" data engineering.

It might be misleading if you just go through the sub-headers but I'm sure if you go through the content you will see that the title and the content are pretty aligned.


Hey folks I'm the author of the post and happy to see that it gets so much attention on HN. Thank you for the incredible comments!

I want to clarify something about my intention with this post. There is a reason I chose "mostly" on the title. I'm not dismissing the different needs of ML.

if a category withstands the tests of the market, then there's good reason for it to exist.

But, we have ended up creating silos within orgs with fundamentally aligned goals because of the way we build products and companies around them.

What I'm advocating for in this article, is the need to think more holistically when we design and build data infra tooling. Yes ML has unique challenges but these challenges won't be addressed by reinventing everything again and again.

Tooling should be built having in mind all the practitioners involved in the lifecycle of data.

It's harder to do but at least we'll stop wasting our time building one Airflow copy after the other that is doomed to fail.


Here's one thing that you missed - transformations. Data engineers hear the word transformations and think they know what they mean - aggregations, binning, data reductions, cleansing, etc. Data scientists, however, think preparing features for models with transformations - encoding categorical variables (one-hot-encoding, LabelEncoding, OrdinalEncoding) and normalizing/standardizing/log-transforms for numerical features.

These transformations are so different that some of them are "model-independent" - you can do the transformations and reuse the output feature across many models (aggregations, binning, feature-crosses, embeddings, etc), but the data scientists transformations are model-dependent (and not reusable across different models - e.g., a XGBoost model doesn't want normalized numerical features, but a DNN typically needs normalized or standardized numerical features.

These differences are reflected in how we build our "ML pipelines" - we split them into feature/training/inference pipelines, enabling us to localize model-independent transformations to feature pipelines, while doing model-dependent transformations in training/inference pipelines (while ensuring no skew).

In summary, the devil's in the details in pipelines for ML. I agree, however, that orchestrators like Airflow are good enough for orchestration. However, the ML Assets (mutable & reusable features, immutable models, immutable training/inference datasets) are different and tooling will ultimately reflect those differences.


Trino can be fault tolerant but you have to explicitly enable fault tolerant execution.

It might be worth running your benchmarks against Trino with fault tolerant execution mode enabled. Check the documentation here: https://trino.io/docs/current/admin/fault-tolerant-execution...

Adding fault tolerant to execution to Trino was a big and complicated project for anyone interested in more details check here: https://trino.io/blog/2022/05/05/tardigrade-launch.html


I have. I am about 2x faster than trino with fault tolerance. But I didn't put the numbers on that plot because this trino feature is still really new and I might not be benchmarking it in the best way.


That’s awesome. For anything Trino - project Tardigrade related, reach out to any of the maintainers, they’ll be happy to help.


Hey ck_one that's a hard question to answer and not get into "benchmarketing" territory.

My suggestion is to try both under your own workloads and see the difference. Trino is also used by products like Athena (AWS) and Galaxy (Starburst) so if you want to play around and see how Trino performs without spending too much time on setting up clusters on your own, you can try these great products.

Having said that, I'd like to add that building a performant distributed query engine is just hard. Trino has been in development for ten years and used by major companies in very demanding environments, these environments is where the technology has been defined and makes it what it is today and it is a proof of its performance and stability.

(edited to add an important disclaimer that I work at Starburst)


And it's so much fun to go through. Most CS books are usually very dry, this one is such a fun read. I would encourage everyone to go and read at least the web version of the book, I'm sure that you will end up buying it at the end.


Nand2Tetris is another fun one IMO, you can do a chapter a week (an hour of reading and a couple of implementation sessions) and be done in a few months


this is awesome! I wasn't aware of it, now that I'm reaching my time for a midlife crisis, I'll give it a try as a way to feel young again and remember my college days. Thank you!


Hey guys, we are excited to announce the release of project Tardigrade on Trino (previously known as Presto SQL).

Project Tardigrade is the first step towards the re-architecture of the query engine and one of the immediate benefits will be the support of more workloads.

Please check the release, give it a try and most importantly give us feedback!


This is insane. Both my mother and grandmother died because of myelofibrosis. My mother was suffering with related diseases for almost 15 years before they diagnosed the symptoms as myelofibrosis and after that doctors where still debating if the disease was there since the beginning or not


I updated the title and added the date. Thanks for pointing this out!

I was mainly driven by the experiment and that’s what I wanted to share here but it’s better to add the right metadata.


No worries! I assumed good faith. :)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: