
Ask HN: What technology is involved in your data Warehouse/ETL process? - pvorb
I&#x27;m confronted with the task of building a data warehouse for my company. Because the DWH landscape is quite overwhelming I&#x27;d like to hear what technology you are using and if you&#x27;re happy with it.<p>PS: We&#x27;re on AWS but I&#x27;d also like to hear stories about other environments.
======
mattbillenstein
We run most of our stuff on AWS, but we use BigQuery for the data warehouse.
It's a nice product - no cluster to manage, pretty good apis for getting data
in and out of it, a decent in-browser SQL editor for casual users, and
connectivity to a bunch of other BI tools (Metabase is a good OSS dashboarding
and visualization tool we use).

We use a mix of hourly/daily .json.gz export/import jobs and streaming
inserts. This is all automated with python scripts and airflow.

For streaming events, we post from the client to nginx, nginx+lua timestamps
and posts that to nsq, we have an nsq python consumer that flattens and
streams those events into bigquery; nsq-to-file rotates line-delimited json
hourly which is gzipped and uploaded to gcs for archival. The latter is
necessary for some types of schema changes; you'll want to drop the old table
and import the old data into the new schema.

Tips: \- stick to StandardSQL, more featureful SQL dialect \- flatten data
where possible, arrays of sub-records are hard to work with \- Partition large
tables by date so you don't pay for querying really old data \- use the
timestamp type vs string/int fields, saves you cast when writing a query

~~~
cpburns2009
Would you recommend Airflow? I've been researching pipeline management
libraries compatible with Python to replace our fragile cron/bash setup. From
what I've read Airflow appears to be the best contender in the Python space.

~~~
jackgolding
I used airflow a few jobs ago where I was a 1 man data engineer (single server
doing ETL) - honestly wish I just used CRON/bash.

~~~
mattbillenstein
When you have a few things to run, yeah -- and we started this way.

When you have a bunch of multi-step things with various dependencies and you
want to visually see status -- airflow is worth the headache.

------
tixocloud
Just for discussion sake since you asked for stories about other environments,
our data is primarily on-premise and we use plain-old Teradata for our
warehouse as well as write our own ETL scripts in SAS. We're looking into
Alteryx to see if we can merge our data together but I don't think any of the
above would be as useful to your environment. We are happy with our plain-old
relational database simply because our analysts can navigate it easily. I've
worked other tools like Informatica but it's been a pain to get them to work
with our data, which is why we wrote our own scripts.

~~~
tixocloud
I will add that the reason we're in this position is because of the legacy
databases/systems we have in place, the nature of our industry and the volume
of our data, all of which makes it trickier to migrate to new systems.

------
huy
It really depends on your budget, data volume and nature. So each person can
only offer advice.

I did this a while back at my previous employer/startup (it got acquired), we
used AWS, PostgreSQL, Hadoop/Hive, using SQL + custom Ruby script for ETL +
processing. You can read some of that writing here
[https://engineering.viki.com/blog/2014/data-warehouse-and-
an...](https://engineering.viki.com/blog/2014/data-warehouse-and-analytics-
infrastructure-at-viki/)

If I can distill them into bullet points, there probably are:

\- SQL is great, stick with it.

\- We started with PostgreSQL and scales a long way with it (using techniques
like table partitioning, unlogged table, etc), then slowly split into
Redshift, and later Hadoop/Hive.

\- Handling events/behavioural data (JSON/semi-structure at large volume) are
very different from handling transactional data (structured, lower volume).

\- Take a more lean/incremental approach to it: get a basic DW setup, load
data that you can immediately act on, act on it (build reports, run analysis,
show to management), then repeat.

You can also check out my startup:
[https://www.holistics.io](https://www.holistics.io) , where I turned those
experiences above into a data platform that automating DW + BI.

We allow you to do "lean ETL" on top of customers' DW infrastructure (be it
Redshift, PostgreSQL, BigQuery or others). We work with small startup to
unicorn tech companies.

~~~
mattbillenstein
Curious why you went to Hadoop/Hive from Redshift? I usually hear of people
going the other way.

I would recommend staying away from Hadoop -- if you're a Java shop, it may
make some sense if you're writing your own map/reduce jobs, but it seems
particularly brittle and hard to debug although my info is a bit dated at this
point.

And generally if you must have Hadoop or Hive, the most elegant thing I've
seen is EMR backed on S3 by .json.gz. You can spin up a cluster for a short
time to run some jobs, and then spin it down when you're done -- hundreds of
small instances seem to work well at this; but most of your interactive stuff
should probably be in another system.

~~~
vlahmot
We do the EMR backed by s3 setup, only with snappy over gz as gz can't be
split.

~~~
mattbillenstein
Ah, word, do you roll up the data by day? Or hour? I think in a situation
where you roll it up by hour and you have a lot of files, it can be spread out
pretty evenly on a large cluster.

------
scapecast
three worthwhile reads on that topic:

\- [https://blog.fishtownanalytics.com/what-are-the-steps-
tools-...](https://blog.fishtownanalytics.com/what-are-the-steps-tools-in-
setting-up-a-modern-saas-based-bi-infrastructure-281e0860f9a9)

\- [http://blog.clearbit.com/enterprise-grade-analytics-for-
star...](http://blog.clearbit.com/enterprise-grade-analytics-for-startups-2/)

\- [https://www.simple.com/engineering/building-analytics-at-
sim...](https://www.simple.com/engineering/building-analytics-at-simple)

Three choices for your data warehouse:

\- Amazon Redshift

\- Google BigQuery

\- Snowflake

Redshift is by far the dominating product in this market. I'd say they have
around 60-70% market share. It also has the richest ecosystem of ETL and BI
tools. We saw Google BigQuery coming in aggressively earlier this year, but
now I'd say the major competitor that keeps coming up is Snowflake. I hear
people are happy with Snwoflake and BigQuery. The major FUD you'll hear Google
and Snowflake spread about Redshift is that it's (1) expensive because compute
& storage are bundled and (2) that it doesn't scale. Both statements are
false.

Then you also have a choice of fully managed offerings like:

\- holistics.io

\- panoply.io

I only learned about holistics.io in this thread. I have to say that I really
like their value proposition and features.

For your ETL - I have a whole spreadsheet full of tools that I'm happy to
share with you. You will have two major use cases, one to get your production
data into your dw, another one to unify it with all your SaaS data.

with intermix.io we've built a performance analytics service for Amazon
Redshift. And so we see all the different ETL tools that companies use. The
ones we see used most frequently are:

\- Fivetran

\- Stich Data

\- Segment

\- Alooma

All are very good products and I can only recommend using them.

Hope that helps!

~~~
pvorb
Thank you for these insights. Especially the article from Simple was very
insightful.

------
StephenAshmore
We do a lot of importing data from a large variety of data sources, so we use
lots of MongoDB for persisting data and Elasticsearch for fast aggregations
and queries. We have quite a lot of data, but we haven't ran into any problems
with elasticsearch/mongodb. A lot of the data we import has backing PostgreSQL
databases from partner companies which adds some extra redundancy.

------
heliosAtwork
Not an answer, just wondering if anyone is using Apache Spark for ETL?

Is this approach viable or if it is just marketing:

[https://databricks.com/blog/2017/01/19/real-time-
streaming-e...](https://databricks.com/blog/2017/01/19/real-time-streaming-
etl-structured-streaming-apache-spark-2-1.html)

~~~
thisone
I use spark for batch data transforms, our data is structurally complicated,
but not for streaming. I attended the EU Spark Summit last year, as of then
Spark streaming wasn't prime-time ready, I don't know if anything has changed
in the last 12 months on the streaming front.

You should have a look at Flink if you're looking into streaming transforms.

------
usgroup
For ETL; Jenkins for orchestration, EMR + Pig + Hive for processing.

Warehousing; S3, Parquet and ORC.

AWS Athena for analytical queries over S3.

Kinesis for stream buffering; also processed by EMR.

Everything except for Jenkins is pay-per-use. No idle components.

Very effective and cheap.

------
richardknop
At Sony Playstation (EMEA region) they use Netezza and Ab Initio. And some
other tools here and there. At least they used those couple years back, they
could have migrated to something else since then.

------
nowarninglabel
We use Snowflake. It's been great. We use Fivetran for loading into it. For
reporting, we use Looker.

We used to use Vertica + Tungsten. It was a brittle combo and also not well
supported by HP.

~~~
pvorb
What does HP mean in this context?

~~~
citruspi
Probably Hewlett Packard[0].

[0]
[https://en.wikipedia.org/wiki/Vertica](https://en.wikipedia.org/wiki/Vertica)

~~~
pvorb
Thanks. I didn't expect them in this context.

------
tmaly
I use quite a lot of Perl and Oracle for ETL. The database drivers have been
pretty solid, and the easy of regex in the language really helps.

