Hacker News new | past | comments | ask | show | jobs | submit login
Data Mesh Architecture (datamesh-architecture.com)
128 points by aiobe on March 19, 2022 | hide | past | favorite | 45 comments



Wow this is an oversimplification. I've had years of experience working in a data lake within a FAANG handling > 5 PBs of data per day ingest. There's so many things this misses:

1. What if the domain teams don't actually care to maintain data quality or even care about sharing data in the first place? This model requires every data producer to maintain a relationship with every data consumer. That's not gonna happen in a large company.

2. Who pays for query compute and data storage when you're dealing with petabytes and petabytes of data from different domains? If you (the data platform team) bill the domain teams then see above, they'll just stop sending data.

3. Just figuring out what data exists in the data mart (which this essentially is describing) is a hassle and slows down business use cases, especially when you have 1000s of datasets. You need a team to act as sort of a "reference librarian" to help those querying data. You can't easily decentralize this.

4. How do you get domain teams to produce data in a form that is easy to query? Like what if they write lots of small files that are computationally difficult to query, whose gonna advise them? Data production is very related to data query performance at TBs scale. The domain team is not gonna become experts or care.

5. What do you do when a domain team has a lot of important data but no engineering resources? Do you just say "oh well, we're just a self-service data platform so no one gets to access the data"?


At some point one has to ask, what are you guys doing ingesting 5pb of data per day?!

Unless this is google, that doesn’t make any sense. That’s an average of 7.5mb per human on the planet, every day.


Why the concept "data lake" emerged in the first place:

Thinking about what to store (and what to log) is not trivial and takes careful consideration. Plus, there's always the argument: "But what if we need something that we forgot to store or log?".

Answering that takes time and risk-acceptance that most developers in most projects don't get.

"Datalake" was just a pseudo-solution to gain peace-of-mind: We throw everything into a big bucket and figure it out later.


Worked in a similar environment. Events and logs. When a single page view does 200+ database queries, triggers another hundred requests, ML services, analytics and other tracking, and a transaction kicks off a chain of a hundred events in your stream, it's pretty easy to reach those insane numbers. That one page view can easily add 1MB of data.

Just like performance optimization, it's cheaper to buy more hardware than to pay for humans to think about it and coordinate.


I did say FAANG :)

Believe it or not this is just for security data. But that fact combined with SOA leads to lots of logs. (5Pbs is the uncompressed amount, but we decompress incoming data, then ETL).


Would love to know some of the answers to the questions you posed


I believe the data mesh claim is that as well as the operational API exposed and supported by some given domain team, the domain team also gets a new goal of exposing and supporting an analytical API to deliver a data product to potential consumers. For this to happen, an organisation would need to value this new objective -- perhaps comparably to operational objectives -- and fund and resource it adequately. Arguably throwing enough budget and appropriately skilled headcount at it might address points 1, 2, 4, 5.

But, it's not obvious it actually makes sense for an organisation to value analytical concerns at a similar level to operational concerns, with similar resourcing. The value to the organisation of building and maintaining analytical APIs to serve data products is likely to be considerably less -- perhaps by one or two orders of magnitude -- than the value produced by actually performing and maintaining the core operational function.

If the stars align, maybe in future the analytical data could be used as input to an optimisation project that improves some key metric of the operational function by 10% or 1%. How much is it worth paying to have the possibility of an outcome like that in future? Not obvious, really comes down to how valuable a 10% or 1% lift would be, and how much it would cost, needs some kind of business case. Not obvious that resourcing analytical data APIs owned by arbitrary domain teams everywhere is a sound investment for the business.


Just on point 1, maybe in the context of said FAANG data quality was by choice if at all. In other industries, e.g. Finance, it can be regulated and audited, so operational teams care in terms of not being able to continue operations. That addresses point 2 partially, it is a complex topic, though.


In my experience data quality in finance is much worse than FAANG. It's common to have just the raw data feed from markets/trades/network dumped into a OLAP DB and whoever is using it has to sort through it whereas FAANG have data engineers to clean stuff up.


I am in Finance and while I don’t question that may be the case for some, the reality is that with regulatory requirements getting more sophisticated, anyone who does not focus on DQ end to end is making a very costly decision. When the quality of data pipelines ends up impacting the capital requirements, the cost of bad data quality is hitting the PL straight away.


On point 1: "That's not gonna happen in a large company."

This doesn't happen in small or mid companies, either. Or if it does happen, it happens begrudgingly. SWE's have too much to do.


It really feels like data mesh is a fairly half baked concept born out of short term consulting gigs and a desire to become a technical thought leader.


I got same this feeling when reading the original white paper linked on the page[1]. It's filled with the kind of bloated abstract "consultant speak" chosen to mask relatively straightforward ideas. And then there is this casual claim at the end of the paper that IMO discredits everything preceding it[2]: "Luckily, building common infrastructure as a platform is a well understood and solved problem;"

[1]: https://martinfowler.com/articles/data-monolith-to-mesh.html

[2]: https://martinfowler.com/articles/data-monolith-to-mesh.html...


Reminds me of first OLAP cubes a lot, something that consultant online praise as much as possible, just so then 3-4 years later they are contracted by the company to fix the mess it created.


What are the downsides of OLAP cubes, and how were they fixed? Curious to level up my understanding.


I guess they had their place in some point and time, but I still vividly remember my old manager speaking about building OLAP cube in 2018. https://www.holistics.io/blog/the-rise-and-fall-of-the-olap-...


What made them obsolete, Snowflake?


Is there an underlying assumption here that all of the datasets' domains are perfectly in sync with each other in the context of domain metadata?

As an example, a Team1 might define the manufacturer of a Sprocket as the company that assembled it, whereas a Team2 might define the manufacturer as the company that built the Sprocket's engine. Since the purpose of a datamesh is to enable other teams to perform cross-domain data analytics, there needs to be reconciliation regarding these definitions, or it'll become a datamess. Where does that get resolved?


Data mesh is not a complete framework, more sociopolitical rather than technical at the moment. When tested in practice, I think you already allured to a key technical component that will need to be more central, I.e. reconciliation. What that means in terms of domain ownership of reconciliation that is an open question.


The chief data officer in close collaboration with the chief data engineering officer must elaborate automated normalization guidelines backed with implementations used across all data streams to insure any skew in the data model is limited to non production environments and all data entities are materialized consistently across the whole data model.


what type of company you are working for? Usually there is not even CIO, I haven't even heard about company with both CDO and CDEO (or even CDEO itself).

I thought big portion of need that data mesh fills is the organizations who are missing resources in their core BI team.


There's no magic. you need a core team that pivots from writing code at O(n) cost enterprise wide to more or less amortized O(1) where n is the amount of work required to process a new data stream - ie having to write code once per stream vs once for a standardized stream that gets reused. With only datamesh I don't think it's going to work but with standardized tools that allow your teams to write transformations and code as data then every team effectively gets access to a self-service data warehouse with only access to pre-approved happy paths that can be automatically monitored for the most part. That's where you gain in efficiency and can let your BI teams focus on BI and not boilerplate code, infrastructure, conformity, etc.


Yes, its similar path that I am taking (while leading BI in my org.) Having first sights of self-service from analysis perspective is super easy thanks to tools like metabase.

For bringing data in, thats completely different story, especially in non-tech organizations. The gap between how power user from specific department and somebody from my team brings and transforms data is still too big and somehow hard to enforce (following naming conventions, keeping same data formats for same columns, lowercasing certain columns, so joins are done correctly...). They usually have their "playground schemas" they use, but its very far from saying that they "own" data quality there.


A data mesh approach probably wouldn't work in the sort of organization you describe.

IMO - To make it work you need a consistent taxonomy or way of translating from a particular domain to some sort of interchange format.

If you have that then a set of centralized tools can pull from the separate domains using a core set of protocols to produce reports etc.


I looks like a weird attempt to build a consulting business around a simple idea.

Treat data assets like micro services and pipelines like network. Period.

Prescribing everything else rubs me wrong way.

So, data mesh is: architecture in which data in the company organized in loosely coupled data assets.


So if I understand this correctly, data mesh is just data mart, that doesn't bring data in database as a table, but uses S3 storage instead (I assume because thats cheaper in the cloud?)


That + a central data platform team that provides infra, quality monitors, data lineage and catalogue capabilities + a central team that provides guidelines on SLAs, metadata standards etc. Sounds good in theory, I am eager to see how it fails in practice


I can chime in as part of the central team for SLAs, etc. We offer a platform to produce datasets given some inputs, SQL, and pushes to downstream systems. Standardized jobs are ran after the user’s SQL to produce standardized outputs.

It works well, but has many issues too. User’s SQLs and input data can differ, often in unpredictable ways, because they bring their own and expect the central team to handle the rest. Those edge cases break the standardization rules, fails the workflow, confuses the user because the platform is a black box, and they ask about changing it or adding a new feature. Now your standardization asks are bottle-necked by this central team, and the options are:

- to wait for the central team to fix/improve it

- find some hack around the platform

- don’t use the platform and its associated toolings, so you build it yourself and have another disjoint system for a specific use case

- central team might build a feature that one team asked for 1 years ago, but now nobody needs it anymore and nobody knows why it’s in code. Repeat many many times for various asks over the years and your code base is likely a foreign mess.

- give your resources/funding to the central team to prioritize your ask. When built and a few years later, the central team owns something they themselves never wanted.


many points of internet karma (and perhaps a profitable career as a consultant) awaits anyone who spills the beans on how their grand data mesh rearchitecture actually turned out a few years down the track, and if the exciting new problems caused by the data mesh were easier or harder to deal with than the boring old problems caused by the organisational and IT architecture it replaced.


You mean like the shit show that is Data Vault? https://danlinstedt.com/allposts/datavaultcat/datavault-issu...

>>data vault 2.0 brings with it methodology, architecture, modeling, and implementation – best practices, standards, automation and more. the ability to encompass and leverage disciplined agile delivery, and sei/cmmi, six sigma, lean initiatives, cycle time reduction, and proper build practices lead us to one day sprint cycles.

And let's not forget that shit show of a "book" https://www.amazon.com/Data-Architecture-Primer-Scientist-Wa...


The concept of a data-mesh is more of a business concept as opposed to tech. IMHO the idea being proposed is that of a conceptual data-server (not to be confused with database server) much like a HTTP server / Mail Server where people can engage with data as a first class citizen and create "data" products. This is especially true as we move from HTML to somewhat HDML (Hyper data markup).

By making data as the product (abstracting all the gory details), you are fundamentally engaging with data through a UI or an API. As you expose these products they become accretive while fundamentally encapsulating the domain expertise within them.


This seems like mostly common sense. Infrastructure teams should always be building tools that the org consumes (and ideally the general public)

In a lot of orgs this goes sideways and the infrastructure teams end up owning everything and never have time to do anything else. Usually this happens due to upper management putting on the squeeze.

In order for teams to actually own their infrastructure and data we need better tooling to help them. This is coming along nowadays but isn’t fully there.


Dunno about the merits of this, but it does seem to be part of the overall effort to rethink how to organize large groups of people working together. With the internet we can afford peer-to-peer communication, and we don't have to organize into hierarchies. But we can't just do full-mesh communication either, because that's overwhelming to individuals, as anyone who lived through the initial slack-and-zoom remote work of early 2020 can tell you. (Though lots of people are still living through it, unfortunately)

So what kind of communication structures are good, and in what circumstances? How do we structure work so that we don't have to communicate about everything? When do we fall back to ad-hoc video chat or even in-person meetings? These are the kinds of questions that 21st-century management has to answer. It's fascinating to watch people grapple with them.


Lots of concerns and scepticism in the discussions here. Any suggestions about good, achievable data strategies and data architecture that work at enterprise level?


Require domain teams' code to communicate (with other domain teams and with the outside world) using the same pathways, schemas, and contracts that are used when extracting a domain team's data into a data lake.

Whether or not that data lake is semi-operated by the team (as proposed in the article) or operated centrally, requiring the lake's ETL process to use at least some of the APIs and tools used for transactional interaction goes a long way towards making data architecture tend towards sanity.

Resist the temptation of things like RDBMS-level CDC/log stream capture or database snapshots for populating data lakes (RDS Aurora's snapshot export/restore is like methamphetamine in this area: incredibly fast and powerful, has a very severe long term cost for data lake uniformity and usability).

I'm not saying "every row in the data lake must be extracted by making the exact same API hit that an internet user would make, with all of the overhead incurred by that". You can tap into the stack at a lower level than that (e.g. use the same DAOs that user APIs use when populating the data lake, but skip the whole web layer). Just don't tap into the lowest possible layer of the stack for data lake ETL--even though that lowest layer is probably the quickest to get working and most performant, it results in poor data hygiene over the medium and long term.


It sounds almost entirely about team responsibility and governance, rather than technical architecture. What’s the difference from a data lake on a technical level?


You get a data lake per team/service.


what's a good word for a region with a bunch of small lakes? the fens?

"Data Mesh" is much trendier branding than "Data Fenlands".


Sounds like a Silicon Fen[1] startup waiting to happen.

[1] https://en.wikipedia.org/wiki/Silicon_Fen



Isn't this usually called a "data mart" as opposed to "data mesh"? Or is the "mesh" term intended to point to something more unstructured, like team- or business division-level equivalent to a data lake? But isn't that just a data pond?


Mesh implies that there are clearly defined join keys on each dataset that allows you do join across domains.

E-commerce example: the warehouse team might produce a batch of datasets, and the web site time as well. A data lake approach would have a single data team owning both sets of datasets. A data mesh would have each team be responsible for maintaining their own, and making sure they’re interoperable (like having a shared order ID concept).


If you need so many "slides" to persuade your clients of something, I think you lost already.


Considering how many big companies go about implementing this right now, I don’t agree. C line likes slides.


Indeed, the Future State Architecture documentation from the central architects that I have seen were all powerpoint presentations with at least 100 slides.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: