And if you want to join data from different SaaS and internal systems e.g. Google Analytics and a Pega decisioning system.
Are you going to spend months upfront carefully modelling the data in order to ingest it making sure to handle schema and DQ issues etc. All to support one use case who only needs a handful of fields.
No. Which is why data lakes exist. Because it's cost effective. You simply dump the data and ask the Engineer or Data Scientist building the use case to do the heavy lifting rather than a centralised data team.
There are integration companies that solve this specific use-case. I’ve used Fivetran [0] and highly recommend it. They will extract-load data from your SaaS to your warehouse and your data scientists can run SQL against the tables. Their most popular warehouses are Redshift and Snowflake. So you can still use a centralized data warehouse without dedicating internal resources to the integrations.
What I find amazing is that Fivetran is a bunch of glue code to forward data between different APIs and database formats, and it's legitimately useful, in part because when the upstream API breaks they go and fix the connectors for you instead of you having to deal with the resulting emergency... but it's only needed because data interchange standards are in such poor shape. If users demanded that SaaS products make data/event streams/replication logs available via robust and standardized APIs, a lot of the use cases for Fivetran would disappear.
Are you going to spend months upfront carefully modelling the data in order to ingest it making sure to handle schema and DQ issues etc. All to support one use case who only needs a handful of fields.
No. Which is why data lakes exist. Because it's cost effective. You simply dump the data and ask the Engineer or Data Scientist building the use case to do the heavy lifting rather than a centralised data team.