I was just hired as the first permanent data scientist in a big corporation. They’ve previously relied on consultants to build the infrastructure and the data science pipelines. We’re still around 10 people in the team.
The code is not pretty to look at, but this is not our biggest problem. We inherited a weird infrastructure: a mix of files in HDF5 and Parquet format dumped in S3, read with Hive and Spark.
Here are the current issues:
- The volume does not require a solution that is this complex (we’re talking 100Gb max accumulated over the past 4 years)
- It’s a mess: every time we onboard a new person we have to spend several days explaining where the data is.
- There is no simple way to explore the data.
- Data and code end up being duplicated: people working on several projects that require the same subset write their own transformation pipeline to get the same results.
Am I the only person here who finds it completely insane?
I was thinking about building a pipeline to dump the raw data in a Postgres and then build other pipelines to denormalize and aggregate the data for each project. The difficulty with this, and any data science project is to find the sweet spot between data that is fine-grained enough to allow to compute features, but fast enough to query to train models. I was thinking that in a first iteration, data scientists would explore their denormalized, aggregated data and create their own feature with code. As the project matures we could tweak the pipeline to compute the features. Do you have any experience with this?
Finally, I love data science and I really don’t want to end up being the person who writes pipelines for everyone. Everyone else is a consultant, and they don’t have any incentive to care about the long-term impact of architecture choices: their management only evaluates delivery (graphs, model metrics, etc.). How do I go about raising awareness?
I don’t know what your data looks like, whether it’s just transactional or a combination of transactional and raw server/app logs. You could ETL the raw logs into an RDBMS like Postgres but you have to worry about maintaining it though and it doesn’t sound like you have enough resources for that. To do that you need help from IT/ops to set up a replica of the live server so it can be queried without disrupting transactional operations and then write ETL code or use a service like Stitch or Panoply.
You can also use a cloud platform like Google BigQuery or AWS Redshift to dump raw data in and then create views and table extracts for all the commonly used business functions. That’s still overkill though and a simple RDBMS should suffice.
And if you want to raise awareness see this article by StichFix and the HN comments https://news.ycombinator.com/item?id=11312243