|I was just hired as the first permanent data scientist in a big corporation. They’ve previously relied on consultants to build the infrastructure and the data science pipelines. We’re still around 10 people in the team.|
The code is not pretty to look at, but this is not our biggest problem. We inherited a weird infrastructure: a mix of files in HDF5 and Parquet format dumped in S3, read with Hive and Spark.
Here are the current issues:
- The volume does not require a solution that is this complex (we’re talking 100Gb max accumulated over the past 4 years)
- It’s a mess: every time we onboard a new person we have to spend several days explaining where the data is.
- There is no simple way to explore the data.
- Data and code end up being duplicated: people working on several projects that require the same subset write their own transformation pipeline to get the same results.
Am I the only person here who finds it completely insane?
I was thinking about building a pipeline to dump the raw data in a Postgres and then build other pipelines to denormalize and aggregate the data for each project. The difficulty with this, and any data science project is to find the sweet spot between data that is fine-grained enough to allow to compute features, but fast enough to query to train models. I was thinking that in a first iteration, data scientists would explore their denormalized, aggregated data and create their own feature with code. As the project matures we could tweak the pipeline to compute the features. Do you have any experience with this?
Finally, I love data science and I really don’t want to end up being the person who writes pipelines for everyone. Everyone else is a consultant, and they don’t have any incentive to care about the long-term impact of architecture choices: their management only evaluates delivery (graphs, model metrics, etc.). How do I go about raising awareness?