Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How do I improve our data infrastructure?
163 points by remilouf on April 20, 2019 | hide | past | favorite | 104 comments
I was just hired as the first permanent data scientist in a big corporation. They’ve previously relied on consultants to build the infrastructure and the data science pipelines. We’re still around 10 people in the team.

The code is not pretty to look at, but this is not our biggest problem. We inherited a weird infrastructure: a mix of files in HDF5 and Parquet format dumped in S3, read with Hive and Spark.

Here are the current issues:

- The volume does not require a solution that is this complex (we’re talking 100Gb max accumulated over the past 4 years)

- It’s a mess: every time we onboard a new person we have to spend several days explaining where the data is.

- There is no simple way to explore the data.

- Data and code end up being duplicated: people working on several projects that require the same subset write their own transformation pipeline to get the same results.

Am I the only person here who finds it completely insane?

I was thinking about building a pipeline to dump the raw data in a Postgres and then build other pipelines to denormalize and aggregate the data for each project. The difficulty with this, and any data science project is to find the sweet spot between data that is fine-grained enough to allow to compute features, but fast enough to query to train models. I was thinking that in a first iteration, data scientists would explore their denormalized, aggregated data and create their own feature with code. As the project matures we could tweak the pipeline to compute the features. Do you have any experience with this?

Finally, I love data science and I really don’t want to end up being the person who writes pipelines for everyone. Everyone else is a consultant, and they don’t have any incentive to care about the long-term impact of architecture choices: their management only evaluates delivery (graphs, model metrics, etc.). How do I go about raising awareness?

They built a pipeline that complicated for 100gb? That’s insanely over-engineered! Very typical of engineers who just want to pad their resume at the expense of unsuspecting business people. I’ve worked with single server data warehouses on SQL Server that were 10x in size and served the entire company.

I don’t know what your data looks like, whether it’s just transactional or a combination of transactional and raw server/app logs. You could ETL the raw logs into an RDBMS like Postgres but you have to worry about maintaining it though and it doesn’t sound like you have enough resources for that. To do that you need help from IT/ops to set up a replica of the live server so it can be queried without disrupting transactional operations and then write ETL code or use a service like Stitch or Panoply.

You can also use a cloud platform like Google BigQuery or AWS Redshift to dump raw data in and then create views and table extracts for all the commonly used business functions. That’s still overkill though and a simple RDBMS should suffice.

And if you want to raise awareness see this article by StichFix and the HN comments https://news.ycombinator.com/item?id=11312243

> Very typical of engineers who just want to pad their resume at the expense of unsuspecting business people.

Or they were given the same PR crap you always get from sales people that they’re just days away from tripling the number of clients and by next year they should be 10-20x the number, so they went ahead and “built it right” so they wouldn’t run into the inevitable scaling issues they were supposedly assured to hit in short order?

A simple architecture should be able to carry this to 10x and even to 100x if you really want to push it.

And I’m not really saying otherwise, though I would somewhat disagree. I’m just saying that they weren’t necessarily (or even likely) thieving contractors who were just looking out for themselves. They built a respectable, usable, system.

Honestly the contractors I see in IT are usually the far opposite end: it works well enough that they’re happy and pay my bill and by the time it doesn’t work anymore I’ll be off to another gig, so who cares?

In many cases, the cause of the problem may not be contractors.

There are lots of clients that clearly set their expectations for contractors who they see as expensive necessary evil: they want you to deliver fast and now, they do not want to hear that bubble that it will take longer to deliver a robust system.

In this case no one technically competent was here to manage them. There were no expectations.

Spot on!!

Everyone has cargo-culted distributed file databases, and they’re good in specific use cases — if you have a large volume of data with a very high number of writes. Hardware and RDBMS performance have improved over the years to the point where if you’re not Google (or certain scientific applications), you probably don’t need much more than postgres. It’s completely within the bounds of feasibility of modern systems to store a 100gb database and its indexes entirely in memory. The only reason you need to scale beyond a single server in most business contexts is when you’re topping out IOPS.

If you just have a lot of data and are doing mostly reads, an RDBMS will almost always be faster for that reason. It’s also FAR easier / faster to write complex queries for an RDBMS.

Oh, and even Google has gone back to a more relational design with Spanner again.

That misses the point, doesn't it. The point isn't "maybe you don't really need nosql/non-relational", it's "maybe you don't need an expensive managed storage solution built for massive scale."

Spanner was indeed built for massive scale, which is reflected in the price.

Hmm, you are probably right.

Spanner does have somewhat less scale than their NoSQL offerings; and even Google says internally to go for the somewhat less scale-y spanner than them. (Because it's easier to react to needs for scale laten than it is to live without transactions and relation querying.)

Yeah... you can go to any DBA/database developer and say "I have a 100GB dataset that might grow to 1TB within 10 years" and they will just pick the RDBMS they are familiar with and you are 90% of the way there.

I work on an ELT process for something that's doing that about now on SQL Server, and not much query tuning is needed tbqh.

I really disagree this is over engineered. This sounds like the problem is under-engineering. You suggest setting up proper infrastrucutre, rather than what they have now which sounds like a shared drive and various different processes written in whatever the person knew to make something quickly.

It's currently one step up from people running notebooks locally and having no shared space for the data.

The term over-engineered has been sufficient diluted to just mean "poorly constructed" at this point.

An my opinion, if your solution is currently not working well, then it can not be over-engineered. Over engineering leads to good solutions that are too expensive, not bad solutions.

I'm interested to hear what other views on what over engineering is. At the very least to get some form of emumerarion.

Well, when several people are working on the same project they "share" the transformed data by connecting to the same EC2 instance. The way data is transformed is via 4 scripts, 2 notebooks and a bunch of manual operations, so no one really wants to touch that. I spent my 4th day working with a contractor to write a Makefile that reproduces all the steps for ONE project.

I talk about adding infrastructure in my original post, but I'm very well aware that my time is currently better spent consolidating the existing as much as I can so the clients can get correct results faster.

As a disclaimer, I work on the BigQuery team, but I wanted to point out that there is now support for transferring data from S3 to BigQuery: https://cloud.google.com/bigquery/docs/s3-transfer-intro

I did use BigQuery in the startup I was working for before, and it worked wonders for our 12Tb of data. I think it would be a bit overkill in our situation---even though not having to manage a DB is great.

That’s the beauty of BQ - it scales well, but it works just fine in smaller use cases. It doesn’t get simpler than SQL.

Another item to consider is that BQ now has ML (simpler) models built in, further reducing the complexity of your pipeline: https://cloud.google.com/bigquery/docs/bigqueryml-intro

If you are not on GCP, then I’d consider AWS Athena for querying the parquet files, but you still have to structure these efficiently beforehand.

I will consider that. How about Redshift?

We had Redshift for our 23TB+ dataset and it worked great. The downside is it can get pricy, so do a cost analysis before you commit. Also know that views in redshift are not materialized so it’s more efficient to create physical tables of the views - which then adds maintenance overhead. The last thing I’ll add is that you’ll need to experiment with compression settings for your data. For us, a combination of ZSTD and bytedict was all we needed

One thing I don't understand regarding resume padding like this (which I do think totally happens) is how do you justify it when someone asks questions about whether it was necessary? It could be very subtle too if they know their stuff and want to see if you know it.

It seems like this would come back to bite in any decent interview.

> how do you justify it when someone asks questions about whether it was necessary?

I know of a local company whose data solution consists of dumping into Segment > S3 files > Pentaho (IIRC) > RedShift, and then using two different BI solutions, depending on the analyst. It needs two full-time data engineers just to keep it alive.

Now the funny part: a dump of their production database is less than 2GB and that isn't going to change any time soon: they don't make that much data to begin with, and their business model doesn't scale.

The argument used for building this new infrastructure is that users used to query directly into the production database and that would allegedly slow down their web app. So they decided they should take an "industry standard" path of handling data. C-levels were too afraid to "just use SQL" and instead asked "what is Amazon doing?".

It is an absolute mess and costed three months of the engineering team just to set up the application to generate the right events, but at least business people has access to data without having to stop an engineer in the hallway.

I don't think this will ever come back to bite anyone in an interview because the fact the dataset has less than 2GB will never come up: interviewers charitably assume that it wasn't overkill or that the person isn't padding the resume.

I frankly believe that a lot of places are like that. We criticize web developers all the time for over-engineering simple apps, but everyone is doing the same in other areas, we just can't see it like we do with web apps.

I think it is worth noting that although some will over-engineer to pad their resume, there are other valid reasons why this may have happened.

It is entirely possible that the folks hired to do the job were better specialized at creating large scale solutions. Client/supplier may have assumed that as a big corp, this segment would scale quickly and a smaller solution would have to be re-engineered at a higher cost later on.

Unfortunately there is insufficient information from stakeholders to make a clear argument.

I don't know your ratio of HDF5 to Parquet files but remember for every GB of parquet you have it will equate to about 10 GB of space needed in CSV or PostgreSQL's internal format. So your data set is probably closer to 1 TB than 100 GB.

Storing that data on S3 is probably 50% the price of storing it on EBS and you won't have the durability guarantees of S3 when you're using PostgreSQL on EBS volumes.

If you're both exploring data and building models then Spark is fine. Its APIs are no more complicated that anything else out there for these tasks.

Hive is doing nothing more than offering schema on read and shouldn't be something you're thinking much about.

PostgreSQL is row-oriented and won't be able to offer features like row-group statistics that allow queries to get minimum and maximum values for every 10-15K rows of data for the columns their interested in. This gives queries a huge speed up over needing to scan over rows rather than just the statistics for the columns their interested in.

Remember that you can have a single engineer run a single query on Spark and distribute it across several servers. This allows you to scale CPU and memory bandwidth in a way you won't be able to with PostgreSQL.

It sounds like your data isn't well organised. If you moved it around and put some consistent naming conventions in place that could help. You could also look to build an atlas of the data for newcomers to get an overall picture of what data you're storing and where it lives.

None of that matters. It's a hundred gigs. You can store it in a textfile and read it in its entirety if you want. It fits in RAM.

It is perfectly reasonable to store this in a database. If and when you change your mind about the data format you can just scrap it and start over.

It's 100GB compressed. Parquet does a very good job of compressing most data so that's where the estimate of 10x (so 1TB) uncompressed was mentioned as a rule of thumb.

Parquet also supports much better access mechanisms, like being able to deserialize a single column without having to read in entire rows.

But like you mentioned, 1TB of data in a traditional database isn't that bad.

... also remembering that a traditional dB will typically not store data raw. Row compression is normal and disk compression is normal . The typical column store advantage is block compression, predicate pushdown and column order storage.

Regular databases such as SQL Server and Oracle have had columnar compression built in as an option along with the row stores for years now. I use it in SQL Server a lot and it works great.

you can run sql DB over compressed filesystem, and some DBs allow you to compress tables too

> like being able to deserialize a single column without having to read in entire rows.

and it reads filesystem's whole page anyway

Sorry for the late reply, but parquet is a columnar format so if it's big enough data, you should have multiple pages/blocks of data in a single column for a specific row group, and then be able to seek to the next row group and sequentially read the next set of blocks.

I’d contend this personally. You can employ disk, or row compression on PG if you want. Compressed disk will actually make your queries faster. You can use cstore for ORC based column storage with PG if you want.

Presumably the cost of a few TB on EBS is the least of your worries.

Finally, the time saving of full transactional support and constraints + sql to write etl in will drastically reduce the amount of work needed to write etl.

IMO, if RDBMS is an option for you, do it whilst your data is small enough.

> sql to write etl in will drastically reduce the amount of work needed to write etl.


My experience with writing an ETL in SQL is that it is almost never, quick, easy, correct or easy to test, and also almost always denormalized, or unconstrained (dimensonal keys which aren't 'real' foreign keys, just numbers so you can parallelize the data inserts and updates without constraint errors).

So... your milage may vary with that.

It's most certainly not true that writing any kind of ETL that uses SQL saves time in all cases.

Well SQL would present the ETL declaratively for one ... whether the output is denormalised or unconstrained has nothing to do with SQL.

In benchmarks I've seen CStore is about 50% slower than Parquet on Spark.

Where is the transactional requirement? This person is working with a copy of the real data.

ETLs only need to be written once and if he decided on a PSQL approach he'd be writing ETLs to send the data there too. He's probably going to find a number of consistency problems so trying to normalise all this data again will just result in more work that won't make his team of DS' more productive.

If he's at ~1 TB of data today, where will he be in a few years time? What's the point of putting infrastructure in place that won't last for the next 10+ years?

The RDBMS advantage is that you can update your records and you can append to them without having to rewrite the dataset. That makes ETL much easier. Eg recalculate a column. It’s also that referential constraints can make sure your database is coherent for you. This saves a lot of time and a lot of mistakes. You also get well thought through scheme management and other benefits besides. Pg11 will scale happily to 10x his requirement. I don’t see why you’d want to build infrastructure for the next 10 years on Spark... since Spark is unlikely to be the thing by then anyway.

I don’t know about cstore being slower at all at 100GB. Nor do I know that it matters for the use case. Spark runs like a dog on a single machine and requires far more resource to do so. PG also has options like pgstrom for gpu acceleration if speed is even s thing.

Also EtL is rarely written once ... it’s an ongoing body of work that changes as the data does.

Disclaimer: I’m a cofounder of Segment [1], we build a product to help with these problems.

Given what you’ve shared here, it sounds less like your problems are related to scaling for data volume, and more related to all of the complexity that comes with a data pipeline. Instead of adding a bunch of new components, it sounds like you need just a few.

My concrete advice:

- Standardize and document the collection point for your data. Create a tracking plan which documents how data is generated. Have an API or Libraries which enforce the schema you want. If the sources of data are inconsistent, it’s going to be hard to link them together over time. - Load all of the raw (but formatted) data onto S3 into a consistent format. This can be your long term base to start building a pipeline. And the source for loading data into a warehouse. - Load that data into BigQuery (or potentially Postgres) for interactive querying of the raw data. For your dataset, the cost will be totally insignificant and results should give your analysts a way to explore your data from the consistent base. - Have a set of airflow jobs which take that raw data and create normalized views in your database. Internally we call these “Golden” reports, and they are a more approachable means of querying your data for the questions you might ask all the time. The key is that these are built off the same raw data as the interactive queries.

We use Segment to manage all of the top three bullets (collect consistently, load into S3, load into a warehouse). Then we use airflow to create the golden reports that analysts query via Mode and Tableau. As other commenters have mentioned, there are a number of tools to do this (Stitch, Glue, Dataflow), but the key is getting consistency and a shared understanding of how data flows through your system.

This is a pattern we’ve started to see hundreds of customers converge on: a single collection API that pipes to object storage that is loaded into a warehouse for interactive queries. Custom pipelines are built with spark and Hadoop on this dataset, coordinated with airflow.

[1]: https://segment.com

Really frustrating. I went through this process recently. The data was a couple orders of magnitude bigger and so I tend to agree that maybe just straight to Redshift / Bigquery would probably work best, but here were our steps:

1.) Insure that ingestion / S3 jobs were stabilized (in our case, the legacy were in Informatica, and maintenance took up all the teams' time). We moved to Luigi for this, but Airflow is great too.

2.) Get Presto schemas defined and make Presto the interface for querying / basic pipelining.

3.) Add Mode Analytics or another basic query UI on top for ad-hoc queries. This cleared a massive bottle-neck for our teams because Analysts and data scientists now have direct access to data w/o technical help.

4.) Build "gold" records, for specific sets/types that are valuable, and get them piped from S3 into Redshift/Bigquery (we built a streaming layer for this). This speeds up querying, makes governance easy, and is extremely reliable.

Honestly, the hardest part here was the change management among our legacy teams.. That said, it's incredible how widely this has been embraced now we have it up and working.

Mode missed the boat by not including ETL capabilities. I worked for a time on the tool that Mode was based on, and its ETL capability was the hidden hand that got Data Scientists to build and maintain the data pipeline.

Develop the proper scheme and put the data in BigQuery. From there, you can use Google Data Studio or Tableau to explore. You can stream in data with pub sub. You can use SQL on BigQuery and you’ll be able to query it all very fast somewhat regardless of size. Then you won’t need to support the old infrastructure and can buy yourself time to figure out if you want to stay with that ecosystem.

Don't dump the data to Postgres.

Instead, define a data model and write an API that pulls data out of the respective places. The API will be the one place your applications get data from.

You don't have to build it all at once. Just code the parts you need as you develop new applications.

Yes the first step is to document where the data lives. Take the time to dig through the mess and document what you have now and where to get all the information.

You want Sales - logon here, select [this] from [that] You want forecasts - email the angry VP and ask for the latest spreadsheet

Once you know where things are - then you can think about rebuilding as you need.

Totally agree, but as a data scientist who was in a similar position a few years ago, you need to make the case for hiring a data engineer. Companies who are new to data science thinks that data scientists are supposed to do all the date engineering work, but without a data engineer you won't be able to produce valuable insights for some time.

I cannot answer your question without full understanding of what is the current usage of your data infrastructure.

Few pointers

- Who are users of the platform? If it is only used by data science team then you can rip apart the solution and work towards a more logical infrastructure where all you are doing is cleansing, normalizing and deriving features and these become your central feature repository which your team can pull and build models. You need a governance so that team is aligned on what features are present and how do they add new features to repository. At scale of 10 people it is much easier to have this all centralized, if team is scaling out then we will have to work out a de-centralization strategy.

- If you have operational reports like business reporting & investor reporting running on this infra then I would recommend keeping analytics workload separate from operational workload. They have different needs and SLA's.

One thing which worries me is you are talking about denormalization as something you are planning to do, that should have been the starting point of any HDFS/SPARK/Parquet based solution.

I can suggest tools for explorations, data quality check etc. But that requires more understanding of what your current infrastructure is solving vs what it was intended to.

Thank you for taking the time to answer thoroughly !

> If it is only used by data science team then you can rip apart the solution and work towards a more logical infrastructure where all you are doing is cleansing, normalizing and deriving features and these become your central feature repository which your team can pull and build models.

It is only used by data scientists. What do you mean by a feature repository? How would you organize it so people can push new features? This sounds very interesting.

> you are talking about denormalization as something you are planning to do, that should have been the starting point of any HDFS/SPARK/Parquet based solution.

It is something that we have to do, but the table have been dumped as is in S3 and every project rebuilds the whole derived dataset regularly. Since these operations are very brittle (a lot of manual work and even transformations performed in notebooks), this is something people dread doing. I am trying to secure this at the moment by writing Makefiles that remove human intervention, but at the end of the day I would like to avoid people spend hours waiting for new data when they need it.

> I can suggest tools for explorations, data quality check etc.

I would appreciate it. Put simply, we get data about the evolution of the stock of clients, transactions with their clients, product descriptions, etc. that is dumped into S3 (I scheduled a chat with people upstream to see what happens). We have 3,4 projects for each client. What currently happens is every team writes the same code to build features in their separate repositories, this code is re-executed every time new data arrives (weekly). These features are then used in prediction models.

Besides the brittleness of the process, I found that people are reluctant to analyse the data because it takes an unreasonable amount of time.

For your problem, I would suggest you to take a look at streamsets. They have an ETL plus data drift system in place, which is really interesting.

Ref: https://streamsets.com/

>Besides the brittleness of the process, I found that people are reluctant to analyse the data because it takes an unreasonable amount of time.

Is this because of the bad queries or way the data is organized?

>It is only used by data scientists. What do you mean by a feature repository? How would you organize it so people can push new features? This sounds very interesting.It is only used by data scientists. What do you mean by a feature repository? How would you organize it so people can push new features? This sounds very interesting.

Can you take a look at Feast by Go-Jek: https://github.com/gojek/feast There are similar projects by different big players in market, this should get you started on idea which I was talking about.

PS: Sorry, was traveling that is why there was a delay in answering your question.

Sounds like you have both organisational and architecture problems here. A data team of ten and you're the only permanent employee? There's no data engineer or infrastructure team you can liaise with? I would discuss this with management, making the problem clear and that you're happy to help bridge the gap, but they need to hire someone with these skills.

For the software side Presto or AWS Athena will give you an SQL query layer over parquet files in S3, which should make life a lot easier. It will also hold the table_name -> s3_path relations. Lower effort than dumping to postgres, though at the current scale that is an option too.

Postgres is not an appropriate choice for a data warehouse, because it isn’t optimized for scans. You want either Snowflake or BigQuery. They’re easy to use and they’re so fast that you’ll be able to do most of your work using very simple, un-optimized SQL queries.

Using HDFS or Parquet-in-S3 raises the complexity of your infrastructure a LOT and I’m not hearing a good reason for it. I wrote a blog post about when you should adopt a data lake, it might help you evaluate this choice:


Like others have said, the technical infrastructure is usually a manifestation of the people processes of the corporation. I think it's valuable to kinda ignore the technical stuff initially, and instead first understand the requirements of your customers. It's totally possible that the current system, as weird as it is, might satisfy your customers' requirements best. Unlikely, but it's possible.

But given that they hired you, chances are they know the current system isn't great, and they didn't possess the domain knowledge to fix it. I'd guess you & your company are aligned high-level that change is needed. It's just a matter of making sure you can align your ideas with the short- and long-term goals of the company, usually with a convincing story explaining how your technical changes drive business value.

For example, you mention that data exploration is hard, and I'm inferring this is a problem because you have multiple consultants independently scouring your datasets. If so, you could communicate to your customers that you can reduce consultant onboarding time from 5 days to 2 days (made that up) if you invested in aggregated datasets or a centralized data warehouse. If you can translate this to a dollar figure (like consultant hourly rate), that's even better.

As for what part you tackle first, I'd suggest finding a problem everyone knows about, but is straightforward for you to solve. Goal is to display immediate value, and gain the trust of the people around you. You don't solve the systemic problem immediately, but the trust you gain is currency you use months from now to really invest in the system. Because truth is, higher-ups rarely value invisible things like data quality or maintainability, they respond very positively to shiny new graphs and numbers.

FWIW I don't know if it's just me, but I feel like the bulk of data science is the ugly pipeline and architectural decisions you're facing now. I read people doing interesting modeling & machine learning work, but I keep wondering how much work went into getting the data into a modeling-ready state. I haven't worked at a company where the % of data team effort going to pipelines is less than, say, 80%.

99% perspiration, 1% inspiration, and the rest is ML.

100GB is so small, it fits into RAM of a single server for very little money. So even if it's stored in some weird formats, you can read it and parse it into memory structures that are the most efficient.

Unfortunately you provide few details on the structure of data, so it's hard to advise anything particular.

Okay, I kind of want to question the "just throw it in Postgres," attitude I've seen throughout this thread. I've dealt with similar sizes (maybe 750GB, number of rows in the small billions) and as much as I adore Postgres it just wasn't a great solution.

-- Loading would take hours, even with dropping the indexes beforehand

-- Creating the indexes afterwards would take even longer

-- Query performance was very hard to reason about. Even simple queries required altering some spillage limits.

-- Analytical queries were basically out of the question, with something like "select sum(thing) from table" taking many minutes.

Even if I could live with the above as just an initial load, there would have been a lot of churn as new data arrived that probably would have been even worse.

> Everyone else is a consultant, and they don’t have any incentive to care about the long-term impact of architecture choices: their management only evaluates delivery (graphs, model metrics, etc.). How do I go about raising awareness?

It sounds like management treats data science as a service function as opposed to a strategic function. Do you get the sense whether they do have any incentive to grow the data science team properly as opposed to leveraging consulting resources? My gut is that there are trending to building out a data science team given you're their first data science hire but it also means that your results will largely dictate what the team will evolve to in the future.

To raise awareness, get a good understanding of the management team and their motivations. Frame the issue to something that they are most concerned with (likely business-related) and highlight the impact of not addressing the issue of poor architecture.

Present your findings, highlight the impact and go a step further to propose 3 options. Highlight why the option you're recommending is the best.

We have a similar issue.

Generally speaking, your work must fit within a value stream - that is, to support your job/function you must do something that's rewarding (someone's got to pay the bills). There are a lot of interesting agile principles at play here but ultimately they revolve around delivery which must occur regardless of how complete, incomplete, fast, slow and/or viable your data is. Delivery is valuable. Your tech leads are the ones most clued up on what must happen and which compromises are acceptable. Their role is to get some data - any data - out so that you can make a qualified success of your product. Cutting corners and managing/deferring difficult decisions is part of that job.

Over time the cadence of features and the complexities of your data will require pivots. Initially you might have gotten away with simple line-level data but with maturity and agile planning the next revenue producing feature might require aggregation and complex transformations. Once again your role is at odds with delivery - if there's a simple way to make the end result a qualified success in the shortest lead time possible, data comprises ought to prevail over your immediate happiness.

Now with that said you might be thinking "wtf this is nuts - it can't be right... I'll never succeed". That's actually quite possible. It's often a reason why most developers move on. When that happens the process becomes self-destructive - new devs start and become disillusioned and leave creating an air of futility; product knowledge is virtually non-existent...all past decisions were shit and should never have been done that way in the first place etc.

So how do you fix this? I'm going to avoid the technical side because I don't know your stack. The procedural/operational solution comes from understanding and managing your employer's approach to agile. The big word here is Trust. Is your team trusted to manage the work items that enter the sprint? Are you trusted to add work items to the backlog? As developers do you own the sprint? Have you created items that address your technical concerns with concrete examples (it might take you several sprints to fully flesh these out)? If you answer 'no' to any of these it's time to address them and start owning the process. If you've tried this or you've failed to explain from a delivery perspective why you have problems and what their risk is (use past examples rather than future commitments) and there's been no traction you should consider quitting.

Thank you for your answer, it was eye-opening. The story is that they opened that department a year ago and needed buy-in quickly so needed to jump-start it by hiring consultants. Whether hiring consultants was a good idea or not, I understand how things ended up where there are. A mix of inexperience and huge pressure to deliver.

Now that we have buy in the pressure isn't so bad. I have a startup background and this is one of the reasons why I was hired. I am not too worried about having to convince my manager, she's great and is the one who started criticizing the legacy (she arrived a few months ago) and asked me to dive in and give my opinion.

She agreed to let me work 30% of my time on this, and the rest on delivering direct value to the clients. The related tickets will be part of the sprint. I will do as you say now, and create items that address the concerns.

Your manager sounds great.

Once you're able to outline the delivery/business/product case consider getting your QA function to champion (author?) the technical solution with you.

It's good to separate the two initially (a well documented problem leads to more options than a quick fix) but at some point you'll want to engineer buy-in. What you've described will almost certainly be a pain point for testers - this makes them great advocates for change and grants QA much needed ownership.

Feel free to dm me if you want to discuss this further.

I'll express an advocation for using SQL as a data pipelining language. Firstly, many SQL dialects are multi-platform and provide standardization for transformations. It's a declarative language that doesn't define how computation happens but what.

Where SQL is terrible to write is when one must pivot data. Each column transformation is defined separately (case whens). When the cardinality of a pivoted vector is high, it results in quite a verbose declaration. This problem can be mitigated for example by generating SQL programmatically with templating languages such as Jinja2. Rendering is handled nicely on platforms such as Airflow when running the rendered SQL in cloud (for example on top of Redshift or Presto cluster, BigQuery).

For writing complex transformations, UDFs and cascading subqueries are the way to go. Window functions are useful for scanning subsets of column values (useful for example in vector transformations [doing normalization, regularization etc.])

SQL is also a language with a gentle learning curve which makes it easy to learn for less software-engineering-minded people (BI people and analysts of different departments in a decentralized data science organization). It's established itself as a lingua franca for matrix transformations already for decades.

Data processing is usually done in batches of different intervals as in traditional data science nothing really needs real-time processing for single events. Then Spark shines. But I would rather make a tradeoff of using SQL and Spark side by side when handling real-time processing than losing benefits of using SQL that I listed above.

When data transformations – with some object ontology related to it other than "just maths" – are to be done real-time, then you better start thinking about building an application for that (using your favorite programming languages).

Even with Spark, around 70% of work is done in SparkSQL.

I love SQL. But it hard to get other DS on board who think that 40 lines of Spark is better than a 10 line SQL query.

The only thing that worries me with SQL is when having to write UDFs for, say, computing a Z-score. But maybe it's just because I have never done it? Do you have any good resources about this?

Don’t worry, I’m having my battles convincing my clients (both business and DS/DEs) that this is a viable paradigm. Here’s a nice-looking z-value recipe by Silota that I just googled up: http://www.silota.com/docs/recipes/sql-z-score.html

Thanks! Do you have any tips on convincing people that SQL is a good paradigm?

I'd just go and write out the technical architecture, defining what are the inputs (the raw data) and what are the outputs (matrices for training, testing etc. etc.) on different intervals (usually, data scientists want the previous days' data processed into some format, A/B test results and such) and how are you going to instrument those transformations. It's not just SQL but the DB where that SQL would be run and orchestration (for example with Apache Airflow), and for concrete ETL tasks (nodes in a processing graph) using a combination of open-source modules (usually in Python) and Bash scripts.

It takes time to get experienced in explaining and mapping these things to the domain.

Here's one approach:

1. Document all datasets at their sources of record

2. Setup jobs to dump their data into an S3 bucket (daily, hourly, whatever makes sense). Use IAM to lock down access if necessary.

3. Setup AWS Athena to give you some analysis capability on those files, while you:

4. Setup jobs to [denormalise/cleanup and] load those files from S3 into postgres RDS/citus data/redshift (alternatively you could denormalise after the load ELT-style using materialised views or [dbt](https://github.com/fishtown-analytics/dbt))

You'll still need a tool to orchestrate it all. I'm excited about trying [pachyderm](https://www.pachyderm.io/) for my next project.

re: point #2 -- you can also make sure access is locked down via bucket policy instead with IAM policy as well, keeps it cleaner if/when multiple roles, profiles, etc. need to access it later on. I assume OP meant that but just wanted to give my 0.02 in case it helps.

Forgetting the specific problem space and technology for minute...

You have a team of 10, only one an employee, and 0 being dedicated to infrastructure.

This screams for a managed/hosted solution. I'd select one that is least disruptive to the team, versus what's technically best.

If you have data in multiple formats in s3 you could look at Snowflake (https://www.snowflake.com/). They have built in ingestion functionality and can translate various formats into their warehouse with minimal effort. You could also look at amazon redshift spectrum which will let you run queries across data in s3.

As someone else suggested, you should provide an API access later so that users only have one way to request the data.

I think this problem is essentially the age old problem of refactor or add features.

Its manifestation in the data world, from an awareness perspective might benefit from similar solutions. I imagine there is plenty of advice about that. Personally I think refactoring is not a management concern - why should they care, and why should you expect them to understand? It's a technical problem and so I think you need to convince your fellow techies of the benefits to them. What might they be?

If you don't have a data engineering function, then one way to promote a better organisation is to make each data scientist an expert in one subset of the data (if that is possible). Then it becomes their responsibility to service any data requests, and they will automate and refactor when that responsibility becomes onerous, and not before, which should be a good way to self regulate your resources.

I have used standards and conventions within data platforms that I have written, but I feel that they are not as important as discovering what you need for your particular job and situation. It seems you have a handle on main factors to consider. Of the top of my head the various tradeoffs are to do with computation time, storage space, code complexity, infra costs, costs of refactoring, future proofing etc. And if in doubt, use the lean and agile approach!

Currently I am in the same boat. Few things that we have tried/figured out are -

(a) Deprecation - Get your butcher hat on. start looking at existing things, and see how many of these are used by who all. Start deprecating (or at least archiving) the offerings that no body uses (or you are not able to find a user)

(b) Simplification - Try to find the infrastructure components (compute engines, storage frameworks) that serve the same use-case, and see if you can converge into one. For example you can converge from HDF5 and S3 to just S3. Similarly from Hive and Spark to just Spark. Don't bring another infrastructure component in the mix, otherwise someone new in your place will make another HN post in future :)

(c) Documentation - Start building a place to document all the offerings that you have. Some wiki style solution or if it works for you something as simple as google docs. Or it could be some solution like Superset/Redash that is atleast bringing everything at one place

(d) Governance - Get some power users in the system, take their help in (i) identifying important datasets, (ii) adding information about existing datasets, (iii) can review a new code/dataset/production deployment

(e) start checking in Transform code/table DDL, all metadata into some git repository. This will automatically build some documentation overtime and take care of duplicate logic overtime

It sounds like your org needs two things,

1. A data warehouse for this data

2. Awareness of software/data best practices

That being said, while I agree code duplication is bad, data duplication isn't as long as you are maintaining data lineage. In some cases data duplication good.

I also wouldn't care too much that you have 100Gb max in a big data architecture. So what? It's not like you're going to be able to get rid of it easily. A data warehouse built from a new set of pipelines seems like the biggest bang for your buck.

The solution depends a lot of the problem at hand. Many people focus on the data size saying it is "trivial", but depending on what kind of data and the format, it may not be trivial. I have e.g. inherited a project where we have same order of magnitude of data over 100 millions files, and nothing was trivial about managing that.

Before recommending any solution, you need to ask yourselves the following:

1. Data ownership: do you own the data, or do other department rely on it ? Or worse, are the data customer data for which you need to guarantee some kind of clear audit trail and access control ?

2. What are the data ? Structured, semi-structured (Log data ), unstructured (images, sound, etc.).

3. What is the data for ? Analysis, training some models, viz ?

4. Can you put the data in the cloud, or do you need to store it on prem ? Questions to consider: data ownership, regulatory constraints, budget for cloud, IT quality in your company, etc.

5. Are the data write once, or are they often modified ?

Generally, I would try to create a single source of truth, but the difficulty would depend a lot on the answers of the above question. If the data are not often modified, then it is much easier to do it: you keep a single source of truth as whatever format is currently used, and you create a pipeline to create derived data (e.g. parquet/hdf5) as simple as possible first. You make sure that the derived data are RO if you can (technically and "politically").

This way you decouple the SST from the format used downstream, at which point you have much more latitude to improve things. If the SST format sucks, you can change it w/o impacting downstream users. You can also "export" the data into different format for different usages, including a DB which is indeed nice to build app/dashboards/etc. on top of. I avoid distributed platforms like the plague, especially if it is managed by the data team.

The difficulty of that decoupling phase depends a lot on the questions above. If you can use the cloud, you don't need an IT team, and backups are much easier to manage, as long as you have the budget for it (and the budget will be small for that amount of data). Another difficult is data consistency: you can often decouple w/o completely consistency (e.g. format consistency is enough, values consistency is not strictly required).

The choice of technologies is in my experience completely secondary to the problems above.

Hiring a data engineer would be your best option. The second best option would be to outsource the infra to a company like datacoral.com or pachyderm.io.

Be aware that a single database or tool is probably not a solution. You may get more rope to hang yourself with.

Any solution is going to be a function of data engineering, systems design and project management. If you lack one of those abilities then you'll need more of the other two to make up for it.

(Feel free to reach out to me on https://www.linkedin.com/in/iblaine/ if you like. I'm not selling anything...I have been in the DE industry for a while...)

Hey Remilouf,

I know exactly how you feel. I was put in the same position and ended up spending 2 years trying to clean up the data and set up the warehouse and handle all the requests for exploring the data.

I had a background in AI and was fully blocked on doing anything I wanted.

4 Years later I build a company to make this easy for people.

I would do the following: 1) Setup a Redshift Instance 2) Use Fivetran to Dump all the data into that Redshift cluster. 3) Leverage my startup Narrator.ai to model and use that data.

Now your team can use your modeled data and you have a clean time-series data structure to do the DS algorithms you want.

Reach out and we can talk about this in details ahmed@narrator.ai.

Firstly, that does sound fairly insane. Barring any unusual computational requirements, your data is several orders of magnitude smaller than the 'big data' stack it seems like you're using.

If you can quickly set up a simple version to test with any of the rest of the team that are willing, that'd be my first port of call. A simple ETL job to pull in fresh data from your S3 store and push it into Postgres. Then you can hopefully get everyone else on board with how much easier this setup will make their lives, and build a consensus that this migration would be a good idea.

A quick idea re. Postgres and 'quick enough' vs. 'fast enough' - what are your requirements around data freshness for the models. Can you take data from a read replica or even your DB backups (extra points for letting you test your backups) for your model training workloads to keep load off the main instance?

W.r.t. the 'It takes several days to explain where all the data is' - in the first instance, I'd draw up a shared spreadsheet listing all this stuff. Yes it's best case going to be eventually consistent with the actual data you hold, but it should cut down the time spent regurgitating the same information over and over and gives everyone a central point to store it. This will be much easier and faster to implement than a full blown data catalog, which you could look at doing once you've got the 'duct tape' solution going.

If you want to chat, email is in profile.

I would pick Postgres. Tables are the best format for data storage, SQL is the best language for exploring them, and Postgres is the best SQL database. With its support of myriad contraints, aggregate functions, and data types, including JSON, you should be able to build whatever view into the data that the different people want, with either database views or functions. You might have to copy data into different databases (or just different schemas) for different teams, but maybe not.

Datasets of a certain size might make Postgres fall over, but I don't think you're anywhere near that. That doesn't mean I would always stuff all my data into Postgres raw. I don't put Word documents or images in Postgres. I leave my Apache log files as Apache log files. If I want to analyze them, I usually use Bash (sed, cut, sort, etc.). But if I want to get really fancy I will import them into Postgres to run SQL queries on them, but usually just a certain segment of them (3 months, a year, etc.), only certain rows (grep first) and only certain columns (awk first).

If your data infrastructure is staid and boring, I think it's a good sign that you're doing it right.

You don't have a tech problem (yet). What you have is more a people problem. Overly complicated architecture, inconsistent ETL pipeline and lack of service discovery are all due to a lack of "mentality". Together with relying on consultants in the first place, I conjecture whoever in charge of your team is more a short term business metric driven type (cost, return, ROI, bonus, KPI, etc).

Thats where the challenge is

1) Look at 3rd party solutions (Stitch, Fivetran, Segment, etc.) to ingest your raw data into a data warehouse (strong preference for Snowflake).

2) Use dbt (https://www.getdbt.com) to clean, transform and model the raw data into analytic tables.

3) Add a BI layer (Mode, Looker, etc.) on top of the analytic tables for reporting.

My experience with dealing with lot's of legacy app and infrastructure is to take a step by step a pragmatical approach.

Identify all the pain points of the current solution, tools and processes. For all those pain points take time to discuss with all the people involved to have a good understanding of the issue. Something you see as really bad might not be as problematic as you thought, some other things you might not have seen migth be worthwhile to look at. Don't do anything before having involved the impacted people otherwise, however good your idea is, you might face rejection.

Be also pragmatic. The end result needs to be more efficiency and, hopefully, people more happy when doing their job. If the main issue is the learning curve, maybe before fixing the actual model, look if the current one require better documentation.

If it is code duplication, look if there's way to share code and knowledge without the need to change the whole underlying data model.

By doing so you can gain trust with few quick but efficient and time saving changes. This will prove that you know what you're doing and that you're not asking for investment just for the sake of technical beauty. Then you'll be able to talk about bigger changes

Usually I also found that not jumping directly into the big technical overhaul and focusing on small changes ends up showing issues that were hidden in the first place or help to foresee a better solution that works for everybody. For example, by sharing code between data scientists you might end up seeing some technical requirements that you wouldn't have thought about. By writing documentation, you might end up seeing some constraints or some way to improve the way data is stored that you wouldn't have thought about.

Don't jump on a technical choice right now, make sure you have a good vision on where to go first and a good understanding on how it will fit in the company processes.

> Am I the only person here who finds it completely insane?

Your are sane, the situation is batshit.

Your priority is to get your data into one place, and one format. But that will take time and money, so you need to make a business case for change.

Your first point you need to hammer home is that its expensive to explore your data, because it's all over the place.

Second you need to come up with a rough number of hours that are wasted a month by each member of staff (and consultant) The fact that you have consultants means that your company is burning money, finding allies in your managers is a good shout. They will be constantly asked to justify the cost.

Third, and bonus points for this, come up with a use case for a current project that can't be done until the data is in the same place, format and is sane and normalised.

Don't worry about pipelines, thats a technical issue, what you are facing is a cultural problem.

Moving from your existing infrastructure to a new infrastructure incurs a cost in both design and implementation. There is also a cost in training existing people on the new infra.

Without much info to go as to how the infrastructure is complex, the first thing I would suggest is to get rid of operations overhead by going with a hosted service. AWS EMR, Databricks, Qubole, etc. offer a service with S3+Hive+Spark.

The reason for using the tools might not just be volume. It could also be due to vicissitude (large number of different data sources using spark for pre processing), or scaling, or something else. Try to understand why the existing solution is used before planning a migration.

If you are determined to go with a new pipeline, build a prototype for a small subset of data science tasks you have and carefully evaluate the pro and cons vs the existing approach for that subset.

The easiest way to raise "awareness" is making it difficult to do the wrong thing. Or making it very easy to do the right thing. Again, this depends on your environment. Techniques you could use are easy default classifications for the data, ease of browsing existing code so that it can be reused, etc.

Sounds like you are have mix of on-perm & aws? If it is all aws, check out EMR: https://aws.amazon.com/emr/. Alternatively, a relatively easier way to build pipelines on AWS is using AWS Kinesis for event streaming to S3 and ingest S3 files to Snowflake https://www.snowflake.com/ - works relatively well for smaller workloads & easy to setup.

IMO - the best way to raise awareness is to build a scrappy prototype pipeline that can be demoed & then demo/over communicate with all the stakeholders :). Having a working demo makes it easier to visualize the pros of the new proposed system compared to the existing one.

For exploring, I started Kyso for this reason, you have data in S3 - explore it using a Jupyter notebook running on ec2, and if you wish you could push your notebooks to Kyso for your team to read (we make them look like blog posts).

We act like knowledge repo for teams so you can add a wiki article explaining where the data is and how to get it, and the notebooks can be downloaded/forked, saved on github, so that your team can re-use them as data transformation scripts.


I'm more than happy to help anyone get started at eoin [at] (company website)

"I was just hired as the first permanent data scientist..."

You hope.

"I was thinking about building a [another] pipeline... "

You're fired.

Shouldn't you first deliver some new insight based on your data analysis skills?

I was also hired because I also have an affinity (affinity, not expertise) with data engineering and am familiar with development good practices. The idea is not to spend 100% of my time doing this, more like 30%.

Business value comes first, and with a better infrastructure we could deliver a lot more value with the same head count.

The infrastructure you described is a fairly common pattern in data based startups. As mentioned introducing a new solution has its ups and downs,best to weight out your options evaluate then implement. Since you're the first person you'll have to lay the groundwork. Just remember to take a strategic perspective as the 2nd person on the data science team may think that you're design might be insane a few years down the line.

We built a data warehousing pipeline which handles 40 million events per day and we use BigQuery. We don't use any other tools like Kafka or Apache Beam. But we use redis very extensively. It's simple system but handles everything. You don't require a complex solution always. But yes you can minimize the complexity with tools like Kafka.

I'd just write a bunch of adapters to transform all your current data into a unified set of formats in acro or parquet or whatever, compress the originals and dump into glacier just incase and then use Athena to query it.

Then you can have s3 file write triggers to ensure that every new file conforms to your new schemas

You could use a data governance tool, eg. [0], to ingest data from several different sources. I think the cover most databases (ie. postgres, mysql, mssql), S3, Hive, Snowflake, just to name a few.

[0] https://www.collibra.com

The easiest thing would be EMR if you want to start moving to aws-provided stuff. You can also start doing ad-hoc queries for better explore-ability with Athena. You can do all of this without changing the data files maybe.

It sounds like you already have an idea of what you want to do, but I think you should pause and think more deeply about what you have, vs. what you want.

What I would want in your situation is:

    - All the data in one place.
    - An easy way to explore the data. 
    - A single source of truth for transformed data.
    - Metadata to explain the data model (ie. documentation).
What you're proposing does some of those things, but it also:

    - Adds yet another maintain-forever technology to your stack.
    - Adds yet another pipeline (or set of pipelines) that does the same thing.
    - Moves from an architecture that is clustered for scale (ie. spark) to one that only scales vertically (postgres). 
    - Potentially introduces *yet more* sources of truth for some data.
> I was thinking that in a first iteration, data scientists would explore their denormalized, aggregated data and create their own feature with code.

^ Moving data into postgres doesn't make this somehow trivial, it just enables people to use a different SQL dialect. The spark API is, for anyone competent to be writing code, not meaningfully less complicated than using the postgres API.

I appreciate the naive attractiveness of having a traditional "data warehouse" in a SQL database, but there is actually a reason why people are moving away from that model:

    - it doesn't scale
    - SQL is terrible language to write transformations in (its a *query* language, not an ETL pipeline)
    - it's only vaguely better when you have many denormalised tables, vs. s3 parquet blobs
    - you have to invent data for schema changes (ie. new table schema, old data in the table) (ie. migrations are hard)
More tangibly, I know people who have done exactly what you're talking about, and regretted it. Unless you can very clearly demonstrate that what you're making is meaningfully better, it won't be adopted by the other team members and you'll have to either live forever in your silo, or eventually abandon it and go back to the old system. :/

So... I don't recommend it.

The points you're making are all valid, and for a small scale like this, if you were doing it from scratch it would be a pretty compelling option... but migrating entirely will be prohibitively expensive, and migrating partially will be a disaster.

Could you perhaps find better way to orchestrate your spark tasks, eg. with airflow or ADF or AWS Glue or whatever?

Personally I think that databricks offers a very attractive way to allow data exploration without a significant architecture change.

The architecture you're using isn't fundamentally bad, it just needs strong across the board data management... but that's something very difficult to drive from the bottom up.

You changed my perspective a little bit by asking the right questions.

> Moves from an architecture that is clustered for scale (ie. spark) to one that only scales vertically

I did a quick estimate of the volume, and we won't reach 1Tb before > 5 years. We're not in a line of business where the number of clients can increase dramatically so it's fairly predictable. I don't want to design for imaginary scaling issues.

> Potentially introduces yet more sources of truth for some data.

It is more intended to replace the current mess.

> SQL is terrible language to write transformations in (its a query language, not an ETL pipeline)

Actually this is the point that concerns me the most. The need to transform the data in non-trivial ways. But surely people didn't wait for Spark to do this?

> Unless you can very clearly demonstrate that what you're making is meaningfully better

This is a very good point, and I think I should come up with a quick POC to demonstrate and get buy-in.

> Could you perhaps find better way to orchestrate your spark tasks, eg. with airflow or ADF or AWS Glue or whatever?

I feel that it would just be solving the mess by adding more mess.

I disagree with the author of the parent comment in regards of using SQL and using Spark instead. I actually first wrote my "SQL advocation" as a reply to this comment but decided to leave leave this view for what it is and write my own "rant" against complicating "big" data transformations with Spark or EMR (Hadoop Pig) or vendor-locked Spark-instrumentations like AWS Glue.

But I agreed with the parent comment's author about pretty much anything until the third bullet point of the second list. I'd like to get more reasoning behind his SQL hate.

Just a spectator to this conversation, ordinary web developer over here, would anyone care to explain to me what form a “pipeline” takes? Is it a server endpoint? I really have no idea.

Data pipelines are typically used to translate data from whatever format the system that produces it speaks into a format that's useful for querying.

As an example you may want to take server request logs and write them to a Postgres table for querying, in which case you'd have something like this:

    Server Logs -> S3 -> Lambda which reads new logs to extract key fields -> Postgres
Once that's done you end up with a database table containing rows for things like URL, source IP, response time. You'd probably also normalise URLs so that /products/123, /products/123/ and /products?id=123 come out as the same thing for analysis.

MemSQL could be a good fit if you need a SQL-based data warehousing tool. They have pretty good integration with Kafka and BI tools.

100GB actually sounds like the job for command line tools on a single computer

For some use cases I would totally agree :)

Postgres should scale nicely beyond that. Stick Tableau and Alteryx on top and you’re done.

For 100 gb that might grow to 1tb:

2x NVME drives, each 1tb in size. Buy pcie adapter as needed.

Buy used 256gb RAM server off eBay.

Install drives into server. Put all data on the nvme drives and all other files on regular drives.

Everything will fit in ram after the data is touched the first time after each boot. Can't get faster than that, usually.

Create user accounts for each user, set up git etc. for code storage and to encourage reuse.

I think it may come across as trolling :)

Successful presentation of this idea is left as an exercise for the reader :)

> I was just hired


> I was thinking about building a pipeline


> How do I go about raising awareness?


You don't get it yet.

Everything there is there for a reason.

Some of it may be technical reasons, some of it may be delivery reasons, some of it may be people reasons, team reasons, political reasons, etc.

You just landed a sweet perm job in a field you love. Don't ruin it by becoming that cliched new hire that sees all their problems and knows how to make it all better. You may be right. Technically. They may even encourage you.

But you could also be wrong technically. You could step on toes politically and end up sidelined. You could end up biting off more than you can chew and end up becoming responsible for the bigger mess later.

May advice is to stop, take a deep breath, look around, appreciate what you've already achieved by getting there, get to know your colleagues, get to know the company, get to really understand the system so when the next new hire comes in you can explain the reason behind everything (maybe getting to know the detailed history of the system and why it formed the way it did?), and make sure you know absolutely everything you can before changing everything you can.

Yeah, maybe you'll feel some parts of your job suck for a while (I wish this was easier, it's stupid that I have to do all this work to get form A to B when I can see a better way), but if you give yourself more time to learn, you're really doing yourself a favor in the long run.

And you're a perm now, in a big corp, doing data science. Relax, you got it made, right?

So chill out and take a deep breath and enjoy your new workplace and everything about it (not just the stack in front of you), and if you still want to make changes somewhere down the line, start small, bit off a tiny little piece you can chew, and succeed with that small improvement before moving forward with anything more.

Also think of it differently, if you do end up being the one responsible for reinventing the whole stack, then milk that project for everything you can. It's a big corp play so you have to do that in a big corp way. There'll be meetings, committees, decisions, stakeholders, teams formed, responsibilities. You could even parlay this project into some greater responsibility and title for yourself, maybe even use it to boost your career. SO think of it not like you are trying to understand a technical problem, but you are trying to understand a piece of (and through it, the enitrety of), your whole new organization, with all that entails: the people, the team relationships, how decisions get made, etc. So enjoy playing that game, because you are happy to be in a big corp, so the sort of benefits that can bring you is what you want, right?

It's not a startup. And if you feel technically unsatisfied, use the time to learn some new languages or skills, or kick some side projects down the road for your own benefit.

> Don't ruin it by becoming that cliched new hire that sees all their problems and knows how to make it all better.

I don't, which is why I'm asking around. I'm also scheduling chats with people to understand the background and the history to see if it's worth changing anything. I will make a move if that make sense. In the meantime, I'm just gathering information to not make a stupid decision.

> And you're a perm now, in a big corp, doing data science. Relax, you got it made, right?

> So enjoy playing that game, because you are happy to be in a big corp, so the sort of benefits that can bring you is what you want, right?

I don't think this was necessary. I've only worked for startups before, and I was hired in part to see if we can do a better job with the resources we have. Buy in from management is not an issue. I am not asking for life advice.

you thought it wasn't necessary because you feel I was trying to be mean to you? I thought it was necessary to remind you. did you feel I wasn't being genuine? I don't think that was necessary to take it like that. but I can definitely understand how you might feel scared you're being judged for working at a big Corp after startups.

I'm not judging you, I think big Corp is a great achievement. I am genuinely congratulating, and reminding you that big Corp is what you wanted, so you can learn to play that game. is that better now?

buy in from management is always an issue. it's just they don't want to give you the impression of friction because you've just started. they're presenting you a side they think you'll like because management has decided it's important to hire and retain talent like you.

you're not asking for life advice? you're saying that because you feel I've been giving you life advice? I can understand you really take your work personally and I think that's a good thing to be passionate at what you do. I've only given you work advice specific to the situation you describe. and I'm happy with what I've said.

but that's enough about you, I took the time to read your post and make an answer, and when you make these comments, I feel like you're making it all about you, I feel you are attacking me for that and like you're not showing any gratitude. that hurts because I just wanted to be seen for contributing my help and perspective. can you when you ask for help not only consider your feelings and also consider the feelings of those offering you help? thank you.

finally, when you ask in a public place this, the answer is not just for your benefit.

hope your new job is good.

You have 100gb, the last thing you need is anything more than the absolute minimum in infrastructure overhead. My advice is plain and simple, use Vertica Community Edition. Vertica is, in my opinion, the best possible technology in these scenarios. Vertica Community Edition is free for up to 3 nodes, and up to 1TB of data. It is the fastest columnar datastore I have used, and once you learn some of the tips and tricks it just works.

I make no money from Vertica, I am not in any way shape or form compensated by them. Dump your data into Vertica, stand up a few servers, and forget about querying infrastructure until your data grows 10x.


I do however own some stock in Domino Data Lab. The actual challenges of coordinating a data science team are tough. Making sure there is a single project repository, one place to manage history, etc. I would consider looking into Domino Data Lab. They have a ton of experience helping teams like yours leverage a data science platform. It's good tech.


TL/DR - Dump your data in Vertica. Find a DS platform that helps you collaborate, Domino Data Lab can be that platform.

Have you considered putting your information into Elasticsearch?

Logstash would allow you to build out centrally (or via CM) pipelines to manage your data with much greater granularity.

100gb would fit on 3 pretty small instances and Kibana would let you sift through that information very quickly.

Disclaimer: I work for Elastic. (Feel free to reach out to me if I can help though!)

This is a nightmarish recommendation. Totally get elastic trying to branch out but there are hard (even mathematical) limitations on what a single DB can and should do. The only DB that's presented me with a semi coherent theory about how it should be used as a swiss army knife is Datomic and I would be very skeptical of it at large scales. Also it's based on Prolog and inspired by algebraic DB concepts so is just way outside the mainstream. Elastic search is really convenient for faceted search, so I think it should just stick to that.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact