We're building a data mesh at Splitgraph [0]. We provide a unified interface to query and discover data products. You can query the data using the Postgres wire protocol at a single endpoint, with any of your existing tools. And you can discover it in the catalog, using a familiar GitHub-like interface. You can try this right now on the public website, where we federate access to 40k open datasets. Every dataset is addressable with a `namespace/repository:tag` format. The `tag` can refer to either the live data, in which case we forward the query upstream, or to a versioned snapshot of data (a "data image") that you build with declarative, Docker-like tooling. [1]
On the enterprise side, integrating the access and discovery layers gives a lot of advantages, especially around data governance. On the web, we give users tools to connect data sources, document them, and share/audit access to them. When a query comes through the endpoint, since we're implemented as a Postgres proxy, we can rewrite/filter/drop it in accordance with rules, or we can forward it along to the upstream data source(s) and/or join across them. If you use Splitfiles to generate versioned data, we can also provide data lineage/provenance and full reproducibility.
We've been working on this for ~3 years but are still pretty early. If anyone wants to help, we just raised a seed round and are hiring a remote team -- check my comment history for links.
I think that the main point of the article is that a company’s data strategy should result in discrete data products aligned with business domains. A domain-oriented team should be responsible for each data product. Data infrastructure should cover universal data-processing concerns, but should not include business logic. These characteristics contrast with a centralized data lake, where a single organization is responsible for both the infrastructure and content of the data resource.
I can't distinguish between what is described and service oriented (SOA) approach:
Discoverable
Addressable
Trustworthy and truthful
Self-describing semantics and syntax
Inter-operable and governed by global standards
Secure and governed by a global access control
A reminder that Thoughtworks was highly influential in pushing Microservices. This may be an elaborate mea-culpa ("oops, SOA was actually more sensible") without admiting 'culpa', rehashing SOA with a set of features (above) that look awfully like those highly elaborate SOA proposals with XML and all sorts of meta-data to 'couple' these "data products' (previously called Services).
this is about data for internal analytics purposes, typically meant to be queried using SQL. if you expect a typical data scientist or business analyst to pull together a bunch of data from a dozen microservices and join it themselves in order to make a report the best case scenario is that it will take 10x as long as it should.
If any folks want to learn more about data mesh, we have a (vendor-independent) Slack to share ideas and insights. I teamed with Zhamak, the author, to launch it. It's still in early days but at 1K+ in a month so hopefully can really help people get the content they need to learn about it all.
[0] https://launchpass.com/data-mesh-learning
I also compiled a list of public user stories
[1] https://www.reddit.com/r/datamesh/comments/m6ecuz/data_mesh_...
it's a real term but it's not a useful term to engineers. there is no such thing as a "data lake system". there are databases, filesystems, object stores, etc. where the term 'data lake' is actually useful is in describing a logical system that holds data pulled from all over the company together into one place to non-technical people. inevitably the actual implementation will be a dozen or more different cobbled together systems and technologies, but if you try to explain that to your finance team their eyes will glaze over immediately, hence the need for the term 'data lake'.
data eutrophication is causing mass die off of insights, while limnic data eruption is well overdue in the majority of the world's largest endorheic data basins
On the enterprise side, integrating the access and discovery layers gives a lot of advantages, especially around data governance. On the web, we give users tools to connect data sources, document them, and share/audit access to them. When a query comes through the endpoint, since we're implemented as a Postgres proxy, we can rewrite/filter/drop it in accordance with rules, or we can forward it along to the upstream data source(s) and/or join across them. If you use Splitfiles to generate versioned data, we can also provide data lineage/provenance and full reproducibility.
We've been working on this for ~3 years but are still pretty early. If anyone wants to help, we just raised a seed round and are hiring a remote team -- check my comment history for links.
[0] https://www.splitgraph.com
[1] https://www.splitgraph.com/docs/working-with-data/using-spli...