Hacker News new | past | comments | ask | show | jobs | submit login
What the Heck is a Data Mesh? (cnr.sh)
130 points by riccomini 4 months ago | hide | past | favorite | 48 comments

From my experience, the core driver behind the data mesh architecture is organisational, not technological. Organisations are requiring more of data, be it for rapid product development, or self-service analytics. Often this involves large numbers of sources (e.g. external sources), rather than just larger volumes of the same thing.

If marketing, finance and sales is dependent on a centralised data team for every new thing, the data team quickly becomes the bottleneck, stifling innovation and frustrating teams. Incorporating the principles of a Data Mesh enables those teams to manage their own data, according to well defined governance standards that enable interoperability.

The reality is that different teams are already managing their own data (via excel spreadsheets, web-apps, etc). If we can apply a bit more rigor to how these datasets are managed (e.g. so they can be shared, integrated, secured, etc), then the whole organisation benefits.

I think I’m experiencing this where I work. The Data Lake is quickly gaining traction and feature requests poor in: please incorporate FHIR genomics resources, please make a UI for this image type, place make import filters to extract meta data from these files… this team seems swamped now. The solution would be to give more power to the requesters? Allow them to access underlying technologies, implement their own data models? Seems logical. Am I understanding this correctly?

Yes, you are understanding it correctly. The idea is that you give the "requesters" access to the data, then enable them to do their thing with it (with training / support / shadowing) and publish their results as "data-products" so that others can leverage it too in their own "data products".

The "data mesh" is essentially the collection of these independent "data-products".

We already see management problems with self-service analytics like PowerBI, Tableau & Looker. Its too easy for people to create dashboards / reports that are subtly wrong and which cause confusion. There is a balance between empowering to build data products and centralised control. Too much empowerment of people who don't understand the right way to do something leads to a horrible mess of contradictory data. Not enough, and people can't effectively do their job. Governance and process is the key to finding the balance and enforcing it.

The issue with the data-mesh is that there isn't really any great tooling to support the management or development of data products, or a data-mesh generally. I am sure this will change over time as vendors start building hype around it.

a bit self serving but I would recommend reading about Airbnb's Minerva (which I created). we leverage this data mesh concept to allow teams to define data independently and then Minerva handles blending the data from different teams together with guaranteed consistency.

you can read more here: https://medium.com/airbnb-engineering/airbnb-metric-computat...

I help run the data mesh community and yup, 100%. There's a reason data mesh is catching on as fast as it is because, if done right, it really feels like it can solve a lot of the agility/scalability problems people feel re data/analytics now. It is NOT a silver bullet but it can potentially really help companies towards that (obnoxiously named and overused) goal of being data driven.

Agree. We see this a lot at clients whom we work with. While I agree with data mesh on a philosophical/ principles level, at the implementation level, it creates a division between data “haves” (those who have the engineering know-how to write parallel processing jobs) and data “have nots”.

End result - implementation of data mesh might deepen the divide between data "haves" and the data "have nots".

A better way could be to implement Trino, Starburst, or Tetmon EdgeSet (where I am co-founder of), to realise the vision of data mesh.

In my experience at three large companies, any project where one part of the organization wants “the data” from another is actually just a power grab at the mid-manager level. To me when I hear “accounting wants direct access to the inventory data” I interpret that as cuz “accounting manager thinks the inventory team is slow or incompetent and thinks if their own team just had the underlying inventory data directly accessible she could cut out the middle man!”

The problem of course is that data has to be interpreted, and often that interpretation is complicated. After all, that is why we write programs and don’t just query/insert into databases directly from the terminal. Most “data” is inextricably tied to the programs that interact with them, and freeing the data without making the complexities of the program known leaves both organizations open to horrible bugs.

I've been on both sides of this coin. As the requestor it's usually because I'm fucking tired of bureaucratic controls and shitty Oracle databases from 1982 running on Windows XP that won't let me execute arbitrary queries "because security".

I just want to do my job more efficiently instead of transcribing aircraft serviceability data by hand into excel.

That's kind of one of the reasons for data mesh. The domain is the one who controls how the data is stored and made available so they get to show off how useful their data is and might get some great insights back. But if the team is so lacking in empathy, that data mesh implementation will almost certainly fail. So there needs to be at least some buy-in but if you can convince a domain that participating is a public good (which many do) and that they actually have more control like this (they get data engineers added or at least embedded in their team), it can be gravy/groovy.

Seems like data mesh assumes a culture of good will and acting in good faith,

This makes a good case for visible data lineage (external system coupling), in conjunction with clear program/ETL documentation (internal data coupling), so you can see the full data transformation.

There are a few cross cutting concerns with a data mesh, namely authz, schema and cacheing. Most companies don't consider the data mesh at a company level which is a shame as solving all the above should be doable at a company level.

I work at Hasura (disclaimer, not to self-promote) and of the user questions I've seen being fielded recently, this has been maybe one of the fastest-growing.

It's typically something like "My org $BIGCO has data in multiple places/databases, and teams have fragmented services they've set up for access with no consistent API or central hub for all of this."

And they are interested in a sort of data-aggregator/central-access point for the data stored in databases of varying dialects + merging their API's into a unified service. Sometimes they also want to (transparently) join/map data across sources too.

I think this space is likely going to become more prevalent just by the nature of both organizational growth and inevitable tech debt.

It's an interesting domain and problem, that's for sure.

Not central to the main ideas of this article, but if you want to have a data mesh that is self-service, why force folks to use a particular storage medium like a data warehouse? That still requires centralization of the data.

Why not instead have a tool like Trino (https://trino.io) that allows you to let different domains use whatever datastore they happen to use. You still would need to enforce schema, but this can be done in tools like schema registry as mentioned in the article along with a data cataloging tool.

These tools facilitate the distributed nature of the problem nicely and encourage healthy standards to be discussed and the formalized in schema definitions and catalogs that remove the ambiguity of discourse and documentation.

Nice example is laid out in this repo of how Trino can accomplish data mesh principles 1 and 3 (https://github.com/findinpath/trino_data_mesh).

Few data mesh proponents ”force” a particular storage medium - and the concept is largely agnostic regarding to this. But lots of early implementations in the wild have decided to standardize on it - either on some cloud object storage or, indeed, a cloud DW.

One cannot argue how much it simplifies things in terms of manageability, access, cataloguing, performance… in an already complex architecture. Especially since no reference implementations exist.

I understand that if your persistence layer is heterogeneous from the get go, layering on top of it might be a solution. But it is also an additional layer that needs to be managed.

Conversely, in your opinion, what would be the shortcomings on centralizing on a modern, cloud-native data warehouse (tech, not the practice)? I see this being articulated less often.

You say, “one cannot argue performance of a data warehouse” but that’s precisely the issue with a DW. DW requires a lot of work to move data from the way that domains model their data on the service layer to how data is modeled in a central DW. You have to wait for data to become live to even begin running analysis on it. Setting up and worst of all maintaining pipelines is an expensive undertaking in both time and money.

It’s not to say the DW is bad and never the solution. The problem is making it the only solution and not providing domains the flexibility to model data the way they need it. You say it’s more complex to manage but that’s the idea behind data mesh, you don’t manage that part, the team with their domain knowledge and data solution does. They can make it as simple or complex as they want internally but if they follow the standards to play in your data mesh who cares? Not your problem. For example say a domain needs realtime data analytics and use something like Druid to store their data. That’s fine. If they want to play in the data mesh you’ve provided, they just need to follow the rules in their data model, but they don’t need to use a cloud DW to do that.

You can’t argue that avoiding the copying of terabytes of data a day from a domain to a DW is more performant than adhoc analysis (MB to GB) of that data. Why move or copy a dataset when you don’t need to? Why force domains to use any solution that’s not actually solving their domain problem?

I have to disagree with almost all of the points you raise.

The degree to which you wish to (re)model data is a design decision - also within a DW. This is what I mean by divorcing the tech from the practice. Methods to store and manipulate semi-structured data exist within cloud data warehouses and there is nothing inherent in the technology preventing from extending said support. Also, it seems to me, that even in advanced analytics incl. ML one eventually works with the data in a tabular form anyway.

When it comes to performance, I was referring primarily to e.g. cross-domain queries. That continues to be challenging in data virtualization / federated query engines.

You (and data mesh) focus on the domains - and nothing prevents carving out ample portions of storage and compute from a cloud DW to them and doing the exact things you propose. Except that "federated computational governance" and enforcing standards is now heaps easier since everyone is relying on the same substrate.

Data mesh targets analytical workloads - and surely you do not suggest e.g. hooking Trino up directly to operational, OLTP databases? One has to, as least how things stand today, copy the data somewhere anyway for not the least historical analysis, as well as often transform it to be understandable to a downstream (data product) consumer. So you will have "pipelines" no matter what - even in a data mesh.

Not only are the domain teams free to build the aforementioned pipelines and model data in any way they see fit - within a cloud-native DW, but also the DataOps tooling available in this area is already relatively mature.

The only part I can somewhat relate to is about "forcing" to rely on a specific piece of technology. But I think that is just something one needs to accept in a corporate setting and a balancing act. And I'm talking about the majority of the regular companies out there, not FAANG. Also, there other arguments to be made - such as avoiding vendor lock-in or if the capabilities of the DW simply does not cater to the specific problem your domain has. But these are not the arguments you made.

Agree to disagree then I guess.

"When it comes to performance, I was referring primarily to e.g. cross-domain queries. That continues to be challenging in data virtualization / federated query engines."

Have you tried Trino lately? Data virtualization like Denodo still relies on moving data back between engines to execute a query, Trino pushes the queries down to both systems and processes the rest in flight. The fact that you use both of those interchangeably makes me think you may not have tried it.

"Data mesh targets analytical workloads - and surely you do not suggest e.g. hooking Trino up directly to operational, OLTP databases?" Not operational data. Are you saying teams aren't allowed to store immutable data in PostgreSQL?

"Not only are the domain teams free to build the aforementioned pipelines and model data in any way they see fit"

You will organically see teams build data infrastructure with different tooling. This happens when you don't tell a team they have to use a DW to play the analytics game. I have never seen a company that has multiple teams that organically land on using the same tech. So naturally (that is without forcing the to use one substrate i.e. DW) you will have different teams (or domains) using different databases. This is why we have DW to begin with. They were literally created to be a copy of domain data that naturally lied in many operational databases.

I get that in an ideal world, there would be some magical one size fits all solution that everyone would just use. However, that system doesn't exist. It's certainly not the DW. DW can be one of those solutions, just not THE one.

"Trino pushes the queries down to both systems and processes the rest in flight."

This is how many data virtualization techniques work as well (e.g. PolyBase from Microsoft) - predicate pushdown is not exactly new. That is why I mentioned both approaches. However, according to my experience, the degree to which this helps is highly workload-dependent. Do you happen to have some reference which would help me understand how e.g. cross-domain joins between large datasets can be optimized with this approach?

"Are you saying teams aren't allowed to store immutable data in PostgreSQL?"

Of course they can. But unless you assume everyone (incl. the numerous COTS and legacy applications in any "normal" enterprise) is practicing event sourcing and/or their internal data models are somehow inherently understandable and usable by downstream consumers, a pipeline of some sort is required. That was my point. And if so, what's the difference for the team to target a dedicated DB within a cloud DW instance that speaks the PostgreSQL dialect - or close vs. a separate PostgreSQL instance?

Furthermore, if you think standardization of tools is not possible within an enterprise and everyone just does their own thing anyway - and mind you, despite suggesting Trino as one such tool yourself, I have low hopes of getting the data mesh standards for governance ending up being adopted either.

"Do you happen to have some reference which would help me understand how e.g. cross-domain joins between large datasets can be optimized with this approach?"

Check out the original Presto paper (To be clear Trino is formerly PrestoSQL). https://trino.io/Presto_SQL_on_Everything.pdf. Also check out the definitive guide https://trino.io/blog/2021/04/21/the-definitive-guide.html for an even deeper dive.

"(incl. the numerous COTS and legacy applications in any "normal" enterprise)"

Funny you bring this up. In my last company we had a couple legacy apps that we didn't even have the code for any more. Just the artifacts (the license for these services were expiring and we were just waiting to kill them off). For the year or so that we had, we were able to write a simple Trino connector to pull these values through the API, and represent them as a table to Trino. And that's the point to your last question. It's the fact that you can meet the legacy code, or RDBMS database, or NoSQL database, or whatever the heck all these different teams own and get access to it in one location without migrating data and maintaining pipelines.

"I have low hopes of getting the data mesh standards for governance ending up being adopted either."

My view is that data mesh should be opt-in and flexible. Each team can pick and choose what portion of their data is modeled and exposed. If a team models their data in a way that doesn't align with some central standard, then they just need to create a view or some mapping that exposes their internal setup to match the central standard. Per the data mesh principle of federated computational governance, each team has a seat as to what these standards are. There can be some strict standards, and some that are more open. Teams/Domains only need to opt-in or concern themselves with the standards that are being requested of their data by the consumers. All the rest they can opt out.

I personally think avoiding company-wide standards is likely the best way to approach adoption. You basically have a list of well documented standards that can easily be searched and understood by consumers (analysts, business folks, data scientists, etc..) and it's up to the domains/consumers to negotiate standards on more of an adhoc basis. Therefore participating in the data mesh isn't some giant meeting everyone needs to join to grow consensus around. It can be much more distributed and less invasive.

I feel like I don't have the prerequisite knowledge to understand the article. Does anyone have any tips where I can gain the foundational knowledge nessessary?

Zhamak's article is the canonical reference. It does a decent job of outlining the problem space:



Many thanks!

Pretty gentle learning path to understanding data mesh: https://datameshlearning.com/intro-to-data-mesh/

I appreciate it!

The concept of the data mesh makes sense, but I'm not sure what it means in practice? You have one big redshift and a catalog that says "this team owns this dataset", and that team does their own ETL? Likewise teams own kafka topics, etc.?

Estuary Flow [1] may be interesting to those in this space.

We're still building, but it's a GitOps workflow tool that tightly integrates schema definition (JSON Schema), captures and materializations from/to your systems & SaaS, rich transformations, catalog and provenance metadata tracking, built-in testing, and a managed runtime. All with sub-second latency.

Flow's runtime uses nascent but really promising open protocols for building connectors to the myriad systems and APIs out there. We're seeing Airbyte's work (itself built off of Singer) as the best steps in this direction and are leaning into that effort ourselves.

[1] github.com/estuary/flow

Feel free to throw in the data mesh community Slack[1]. There was an interesting approach that sounds kinda similar re schema contract management from FindHotel that they posted a few weeks ago re data mesh[2].

1 https://launchpass.com/data-mesh-learning 2 https://blog.findhotel.net/2021/07/the-evolution-of-findhote...

I have a dumb question. Could I use flow to import a text file into a postgresql database? The text file is not append-only.

There's a lot of tools to import logs into stuff like kafka but not to import whole files (that can change) to a database.

Yep. You can, for example, have it watch file(s) in S3, and every time a file changes it will flow its records through into a table it creates in your DB, keyed on your (arbitrary) primary key.

Any way to watch local files too? S3 might be fine, but just asking.


My understanding of a Data Mesh is it's an approach to turn data into a product, much like you would create an API to interface with a service. A Data Mesh is additional business logic to make data easier to understand, at the cost of implementing that business logic.

A Data Mesh sounds eerily similar to Kimball. It's an up front investment to simplify the data. Kimball is frowned upon these days because dimensional modeling is another hurdle to your data. It makes sense that a Data Mesh would get the same treatment.

The fact that "Data Mesh" is pushed by a consulting company has me suspicious as well.

> A Data Mesh is additional business logic to make data easier to understand, at the cost of implementing that business logic.

It's logic that must be written nonetheless (in fact, logic that is currently written at any company with massive data and disparate sources of it), but instead of a centralized team of data engineers becoming the bottleneck -- and possibly misunderstanding the data -- the writing of said logic becomes the responsibility of the team who owns that particular domain, removing both the bottleneck and the hurdles of working with data you don't fully understand.

> The fact that "Data Mesh" is pushed by a consulting company has me suspicious as well.

That is a fair concern. Much of the software industry is busy selling snake oil and fads. It's our job to find the actual content and practices that work, and ditch the snake oil.

> the writing of said logic becomes the responsibility of the team who owns that particular domain

Thanks, that much makes sense. I'm still not buying into the idea of a Data Mesh, mostly because pushing costly requirements upstream is a hard idea to sell.

I worry about that too. I guess the idea is to tell them: you own this, you can deliver faster and better than some central data team.

But it is indeed a hard sell. I wonder if there are real success stories with no caveats. I'd love to learn about them.

100% with you on the last sentence. I have listened to a few podcasts about Data Mesh by Thoughtworks people and the similarities to pushing Microservices are striking. There might be some benefit to this but the operational and mental overhead first and foremost ensures billable hours for consultancies.

Having just read this and Zhamak's article, it seems that there may be some incentive alignment issue with this.

I assume a lot of valuable data originate from customer-facing applications, so the team that already has a customer-facing product now has to manage a new internal-facing data product.

My worry is that the data product won't get the love it deserves.

This is "solved" (at least to some degree) by adding additional resources to the domain teams - data engineers get embedded and/or added to domain teams to become the data product developers in most implementations. Hard agree that you cannot give a team significantly more responsibilities without more resources to help handle them.

Is this done in an incremental way? Changing the org from the central-data-team-as-bottleneck to this is a huge step. Just thinking of all the buy-in you need makes me dizzy. Everyone seems to want direct access to the data, but are they willing to do the effort of taking responsibility for it as well?

I'd love to see how a smooth transition to this goes :)

I’d worry that the extra headcount will just get sucked into operational priorities.

The payroll engineers are always going to prioritise fixing payroll problems over supplying data scientists. More engineers could easily just end up being used fixing the backlog.

> I use data warehouse, data mart, and data lake interchangeably here. Zhamak uses the term data plane.

Nit: "Data plane" usually points other thing (the data plane / control plane distinction). I'd would that part of note since it'll add another layer of confusion.

> While development teams spend time documenting, versioning, refactoring, and curating web service data models and APIs, data goes largely ignored.

I stopped reading at this point. A data model is mother-fucking data. Web servers just, well, serve it. </concept>

You should have kept reading, because the essay has some good points.

I think there is overlap: a data model is a kind of data, but not all the data. Other kinds of data often get neglected.

It solves many problems, but I think the GDPR side of things and protecting PII will be challenging since everyone will get a piece of raw data.

Another challenge that I see is maintaining security or migrations. Unless the central team has strong influence on technology selection for teams.

Why the title change?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact