Hacker News new | past | comments | ask | show | jobs | submit login
Koheesio: Nike's Python-based framework to build advanced data-pipelines (github.com/nike-inc)
217 points by betacar 10 months ago | hide | past | favorite | 73 comments



I'd like to understand what data engineering inside Nike is actually like. I'm curious because I have relevant experience on my LinkedIn profile, and I get reached out to almost weekly from third party recruiters trying to fill really low paying contract data engineering and ML jobs with Nike. These roles seem to be targeting people with professional experience in the US but pay roughly a 3rd of what I would consider the going rate. There's another top level comment here that this tool might make sense "in a shop with a lot of inexperienced devs", which would confirm my anecdata. Maybe the roles are actually scams, who knows :shrug:


Nike's data engineering is very bad. It's hundreds of temporary contractors, mostly offshore, all with 6-18 month tenures, and everyone reinvents their own square wheel. Thousand upon thousands of abandoned confluence pages of documentation. The most convoluted SQL and data architecture you'll ever find. Getting answers to simple questions like "How many shoes did we sell in-store vs ecommerce last week?" is a nearly impossible task.


> How many shoes did we sell in-store vs ecommerce last week?

That is perhaps not a great example. My brother is a business analyst at Nike (has been for 15 years or more). I just asked him how hard it would be to answer that question and he said it would be pretty easy. Granted, this is the kind of data he works with routinely, so it may be more difficult for other teams that do not.


Is he the person running the query or is he reading a blood spattered report pulled from the dead hands of some data engineer who perished battling the system to retrieve said data?



A little of both? He's a team lead. Definitely runs queries himself, but probably spends more time interfacing with management and other teams on a day-to-day basis.


> Getting answers to simple questions like "How many shoes did we sell in-store vs ecommerce last week?" is a nearly impossible task.

I find this type of thing scary as an outsider looking in. How a company so large has such immature engineering continues to astonish me.


> I find this type of thing scary as an outsider looking in. How a company so large has such immature engineering continues to astonish me.

It's management that doesn't want to risk their positions by doing the very difficult business of either starting over or properly simplifying their stack. It's not easy, it's not quick, but if they can't even answer that basic question then they need to do the work.


To defend management a little bit, these massive companies have existed through many eras of technology with many different managers. They work with many external companies in many different ways. They have an exceptionally complex, but functioning tech stack, that allows all of these many dependencies to function together. Lastly, they are successful as they are!

It's not usually an issue of immaturity, it's just really hard. To make things worse, often people don't really want to do the work because literally any other data engineering job would probably be more enjoyable.

Simplifying the tech stack would probably require simplifying their business operations, which probably means less revenue.

Starting over is often literally not possible because there are so many interconnected systems that aren't all necessarily owned by the company trying to make the decision...


Nike's success in the marketplace has approximately zero to do with their tech stack. That's why their tech literacy is so bad. It just doesn't matter to their business.


This is the most succinct point in defence of management really. Tech stack has no impact on business results, so quite rightly they don't make it a priority to 'fix' it.


> Simplifying the tech stack would probably require simplifying their business operations, which probably means less revenue.

I don't understand this perspective. Simplifying the tech stack might mean taking multiple services in multiple languages, and deprecating some in favor of migrating that functionality to the most maintainable codebase. This shouldn't mean "simplifying their business operations", or affecting their business operations in any way.


I agree there probably are areas were they can simplify with no impact to operations.

But I would imagine there are a lot of pieces of apparent cruft hanging around that is actually there because if you remove it things break.

Maybe a large retailer that you rely on requires an integration with an old version of SAP, and then a logistics partner only provides files over FTP, and you need to use OCR to retrieve any data from the files they're sending.

Management can't just mandate that you will 'simplify the tech stack'. Even refactoring smaller parts of the tech stack is often a pretty massive job.


"Look, my setup works for me. Just add an option to reenable spacebar heating."[0]

> Management can't just mandate that you will 'simplify the tech stack'.

In my experience, it's the devs who want to do reactors for love of the craft and satisfaction of a job "well" done, while management seeks to sweep as much technical "debt" as possible under the rug, and they'll burn through as many Romanians' nights' sleeps as it takes to reach their numbers.

0. https://m.xkcd.com/1172/


You'd be surprised... I am not...


My experience with these kinds of tools (and I've built some myself) tells me that you're better off hiring people who know what they are doing or having enough experienced people PLUS culture to train up juniors.

The idea that you'll just build a tool that makes hiring 10x as many inexperienced devs work is dubious. Just one more new DSL bro. Certainly we have cracked the code that no one else has.

The problem with these types of orgs/tools is that by its nature your DSL constrains the juniors/inexperienced devs to what is currently possible. There is not a lot of learning unless you rotate them through the tools team periodically, which no one does. It's also awful for the devs who are building in experience in something they can't use anywhere else.

I was in one shop where the "tools team" guys epiphany was he would meta-recruit by poaching the best data engineers out of the ETL team, lol. Very explicit "good team / bad team" vibes.


Speaking only my own experiences, my contract was ~2x what competitors were paying in the US. This was similar amongst the contractors I worked with, depending on seniority.


If I had to guess, a tool like this might be useful in a shop with a lot of inexperienced devs. It's a thin wrapper to make sure everyone walks the same well worn path the same way. You have 2-3 devs work on the tooling, and a much larger team doing rote ETL.

I worked at a shop that did this and the trade-off is TTM, as your 2 person tools team is constantly needing to unblock ETL team with new features as they encounter new requirements in the wild.

If your ETL team is 20+ people and the tools team doesn't have a head start, tools team will quickly fall behind an insurmountable backlog as your ETL team spins its wheels. But you might save some money if you choose the right KPI..


I think this is the case: when you run your pipelines at scale you want to standardize and simplify some repeatable aspects to lower the cost of managing them. You may also want to be orthogonal to orchestrator engines (or triggering engines) and avoid getting too opinionated and inflexible in the future. So this framework is exploring some sweet spot between raw spark pipelines and low code etl engines.


yeah though a lot of these fall for a variant of the "universal standard" conceit joked about in xkcd. All these low-code solutions suck, so we'll build our own in-house that surely won't have the same pitfalls..


I built a data processing framework at GE that let junior devs write whatever code they needed to transform a particular input. It provided an interface that they had to satisfy (for data lineage metrics) but otherwise scaled their code without them having to understand the distributed architecture or anything about the platform. Exceptions flowed up to the platform and became part of the data lineage metrics.

I walked into 20 years of adhoc code that had zero data lineage, recoverability, or scalability that was breaking daily. There were contractors with over a decade of tenure whose job it was to troubleshoot and fix their own brittle processes (and make new ones) daily.

I got laid off (747Max plus pandemic) as I was rolling it out and they went back to the old way.

Subsequently, a new startup (Pantomath) emerged with former GE engineers (and other former colleagues of mine) from my former department to address that problem domain.

Based on my experience trying to socialize this type of solution, sales are going to be a bitch.


A framework with composable building blocks, allowing devs to unblock themselves by adding the functionality they need is a good solution.


What is TTM: talk to me or trailing twelve months


I think Time to Market.


Many data engineering problems are impeded by strong typing, particularly type transduction applications (translating between a database type system and a transport such as Avro, for example). While in many cases that is somebody else's problem -- it is solved in a library -- when it isn't the strengths and facility of a dynamic language can save you considerable code complexity and maintenance. Type control is often central to reporting as well, and it is, again, more awkward in a strong typing context. I would tend to argue that insistence upon type frameworks such as pydantic in a data engineering framework is naive and imposed by academic rather than industry experience. There is a reason that python is chosen for data processing applications, and it certainly isn't typing.


Python is chosen for data processing because it's one of the most popular languages in the world and it has a passable REPL (which is more than most languages) so you can use it for experimentation.

From the readme they describe it as "robust" and having a high level of type safety, so I'm guessing they're just leaning towards the "learn about bugs up before they hit production" end of the spectrum than you.

Then again I don't do any of this data engineering stuff so maybe it doesn't matter too much if it doesn't work reliably?


You're right, but this problem doesn't go away when you use Python - because you still have to interact with the type systems of the platforms you are integrating. It's a much more difficult and nuanced problem than I think most people realize.


Giving this some more thought: I do know that Nike has a revolving door for developers. It seems that a framework like Koheesio allows Nike to essentially hire for scala from a pool of candidates that only have python experience. Once hired, as they use pyspark and koheesio daily, they don't even know they are scala developers. Much easier to hire/fire python developers these days.


"the report just has to look right"


I used to work a little with ETLs, Spark, Storm, etc and I honestly don't understand the value proposition of this library. I'm no data engineer expert by any means (it was like 2 years working on data eng stuff about 30% of the time 5+ years ago), but I expected that at least I'd get what this is useful for.


From their docs: > Koheesio is a Python library that simplifies the development of data engineering pipelines. It provides a structured way

I think this pretty much sums it up, "a structured way".

It's looks to be a thin wrapper around spark to provide a consistent way to structure ETL jobs. They've implemented a mini dsl defining jobs as a datastructure on top of Spark.

I've seen several companies build stuff similar to this internally, defining jobs as a data structure. It all amounts to each company having their own internal conventions, their own view of what is easier for their devs and creating a framework for it. Nike have just decided to make theirs public.

You can do all this simply with simple spark scripts. Personally I'd use simple spark scripts. Big companies with lots of people love making these tools as companies love conventions, their conventions, their style guides, deal with staff churn/on-boarding frequently so believe these kind of things make that easier.

Probably makes sense in Nike as a way of organizing their ETL jobs but that looks to be all it brings. A way to structure/define your simple spark jobs the way Nike devs believe it should be done.


it looks like a layer of sugar over PySpark

seems to be Spark-only AFAICT?


This, at a glance, seems pretty simplistic. A neat project, but not something I would have expected on HN front page.


After decades of working on overly abstracted clever applications the only place I see elegance is in simplicity. I’d like to see more libraries like this on the front page.


A probably better explanation of what it is and why you might want to use it (or not) can be found here:

https://engineering.nike.com/koheesio/latest/tutorials/onboa...


A few weeks ago, I chose to write my data pipelines using Apache Beam. It seems that Koheesio shares some features with this project, but I believe Apache Beam is superior due to its ability to run on various runners, support multiple programming languages, and integrate with numerous data sources and destinations.


If I wanted an abstraction layer for writing pipelines today I'd look into ZenML: https://www.zenml.io/vs/zenml-vs-orchestrators


Oh so like luigi? Great!


Check out CloudQuery - Arrow powered ELT framework (Author here :) )


Yes we loved when you pulled the rug and made everything « premium ».

I’m not against proprietary software, but your website still advertises this product as an open source ELT.


Neato, and I personally appreciate your finops.sql example query :)


> Koheesio is not in competition with other libraries.

Yes, it is, because nobody wants to run multiple orchestrators, and the "What sets Koheesio apart from other libraries?" section does little to help users decide why they should pick yours.

Workflow orchestration is a mature category, as evidenced by the length of this list: https://github.com/meirwah/awesome-workflow-engines

I would expect someone who's seriously writing a new orchestrator in 2024 to cite the alternatives, their shortcomings, and how you intend to address them. Bonus points if you make a neat little table.

The fact that you're leading with Python does not inspire confidence. Pretty much all workflow orchestrators use Python for their glue, and that's hardly the interesting part.

What were they using at Nike before this?


This is the kind of attitude that makes people, companies, researchers hesitant to publish code online. Its free code for everyone to see, they don't owe you anything. Its not necessarily a "product" for your consumption, its just a repo.


It's a fair set of questions, posed fairly directly, without any sugarcoating. If you read it as a ruthless critique, it stings. If you read it as constructive feedback, it can bring the author a lot of value.

If the author can clearly show the value proposition of their library, it will get more adoption, and the community and the author will get value.

If the author realizes they coded something that is already a well-solved problem, or a poorly-constructed alternative, the author and Nike could gain by throwing this away and going with a better alternative.

Personally, I wasted too many hours of my life creating a solution that did not solve any problems. Had I done some thinking and/or market research ahead of time, it would have saved me ton's of time to work on more worthwhile endeavors.

Good feedback is worth its weight in gold, even if it hurts a bit to hear it.


I don't think so; this is a corporate project, not an enthusiast's. They probably had to internally justify building it over using an existing solution, so they could simply share their rationale. I am not trying to rain on their parade.


It's a library written by some devs who thought it might be useful to others, too, and/or are proud enough to share their work. It's not that they'll publish it in an SEC filing.

For every n-th solution in the market, n-1 existing ones could have been used, but they weren't for (many times) good reasons.

And looking at their Makefile and pyproject.toml I can see that they knew what they wanted.


yes, Nike is a corp where everyone sings kumbaya and the engineers have unlimited time to work on and publish libraries they are proud of, they don’t have to deliver specific values at all.


the kumbaya and coding is only allowed in the 5 minute lunchbreak at the nike sweatshops


I think this is a fair criticism. You chose to share this on hackernews, which invites feedback (including constructive criticism). I don't see the problem ¯\_(ツ)_/¯ Do you expect people to only voice praise for open source projects?


Why do you think the poster is connected with the repo? I couldn’t see any link.


I don't. Regardless who posted it, discussion and constructive criticism is warranted in the comments.


/s

Like it or hate it, it’s how social media works. Everyone posts other people’s things, and then more people get together to rip it to shreds.


Regarding comments saying that parent comment lacks good faith, and code is free so take it as is, should also consider that parent has all the rights to express his concerns and opinions. If you don't like criticism or inconvenient questions, just skip it or ignore. For example parent insights were interesting to me. Good questions that i can learn from how critically evaluate things.


> Workflow orchestration is a mature category, as evidenced by the length of this list (...)

Or could it be an evidence that the existing tools all have their flaws and any reasonably sized organization will hit these flaws pretty early, so many orgs come to the conclusion that the best approach is to just roll their own that fits their case relatively well?


no offense, but you clearly don’t know much about orchestrators


Are you at all familiar with the data integration / ETL framework space?

If so, I think you would recognize how unreasonable the presumption of your comment is. It certainly strikes me as such.

Most if not all the big OSS frameworks and commercial offerings originated in one big corp and subsequently moved into the Apache direction (Airflow) or spun out into their own companies.


> Workflow orchestration is a mature category

As far as I can tell from the docs, this is not an orchestrator at all. It is a Python-wrapper for certain runtimes like PySpark. I don't see anything in the docs that mentions DAGs, dependency definitions, scheduling, or deployment.


Every task orchestration tool kinda has a crappy security model. Sure they're a ton of them but when you start putting it all over the place it's just a hectic to get right. That is a feel the space someone could make gains in a big way.


yeah they need to explain why this is better than flyte


I'll agree that the pages about Koheesio hosted on nike's website are strangely empty of persuasive specifics (seems like generative AI used by someone who was in charge of writing it up lol..), but maybe the authors believe that someone who's seriously considering a data pipeline library would want to look at the actual code (which by the way, is structured quite well). Python doesn't "inspire confidence"? Think of the target audience. The frameworks claims to take typing and tests quite seriously, so perhaps your confidence is gained elsewhere. I didn't go through the codebase because I'm not interested, but your dismissive attitude doesn't radiate good faith


>nobody wants to run multiple orchestrators

Straw man argument.

>I would expect someone who's seriously writing a new orchestrator in 2024 to cite the alternatives, their shortcomings, and how you intend to address them. Bonus points if you make a neat little table.

Did you even try to read the docs before you launched this critical diatribe?

From the docs (https://engineering.nike.com/koheesio/latest/tutorials/onboa...):

    Advantages of Koheesio

    Using Koheesio instead of raw Spark has several advantages:

    Modularity: Each step in the pipeline (reading, transformation, writing) is encapsulated in its own class, making the code easier to understand and maintain.
    Reusability: Steps can be reused across different tasks, reducing code duplication.
    Testability: Each step can be tested independently, making it easier to write unit tests.
    Flexibility: The behavior of a task can be customized using a Context class.
    Consistency: Koheesio enforces a consistent structure for data processing tasks, making it easier for new developers to understand the codebase.
    Error Handling: Koheesio provides a consistent way to handle errors and exceptions in data processing tasks.
    Logging: Koheesio provides a consistent way to log information and errors in data processing tasks.

    In contrast, using the plain PySpark API for transformations can lead to more verbose and less structured code, which can be harder to understand, maintain, and test. It also doesn't provide the same level of error handling, logging, and flexibility as the Koheesio Transform class.

It took me less than 15 seconds to find the solution to the problem you propose. How long did it take you to formulate your critique? Do you perhaps just have a prejudice against Nike (corporate haze), or is it an investment in a 'competing orchestrator' that is clouding your judgement?


Every modern workflow orchestrator does those things you quote, and more. You make it sound like they're innovations when they're table stakes. Why wouldn't you just use Flyte or Kubeflow?

Also, the fact that they say that the alternative is "raw Spark" tells me either that they're confused, or not very good at explaining. Spark is used to execute tasks in a pipeline, not to orchestrate it.


While I generally tend to agree with your basic criticism, I think you need to keep in mind our perspectives might be biased due to limited data.

Flyte went OSS what, 4 years ago? I'm not super familiar with it, but a) could have been that it was too unpolished at the time or b) requiring K8s to be a non-starter for some teams/ orgs. Same for Kubeflow.

We also don't know for how long Koheesio existed within Nike.

In short, there's a lot we don't know and there's a good chance the internal reasoning for investing in this made sense under certain circumstances in the past.


This project is two weeks old!


The repo on GitHub is two weeks old?

Doesn’t mean the project originated two weeks ago.


So this is like Broadway (Elixir), but for Python?


That's really cool, did you already saw the dlt library? That one's done for very easy to use EL in python. It's similarly modular and built by senior data engineers for the data team, and the sources are generators which you could probably use too.

How is koheesio different to dlt? Where could they complement each other?


Had Nike as a client for a period of time, interacted with quite a few people across their data org. There is absolutely no software you want authored by them.


I worked at a large Healthtech company before. I get the sentiment. But, in the forest of software based on poor decisions, there were definitely a couple of valuable gems made by very knowledgeable people. Generally those were the people with an "opensource mindset" (as opposed to the "my ultra-crappy code that is just some FOSS glued together is super valuable IP and you need to read 100s of QMS docs before you can lay your eyes on it"-people). Don't measure the whole org by the same yardstick.


This is just flat out rude.


It's a big organization, but I can understand the feeling, because I had the same attitude towards Microsoft, Oracle, Salesforce and many others.


It’s a big organization, which means there will be areas full of great people and areas full of less-than great people. To discount all work from a group that large is just silly.


Another snakemake?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: