
Building Analytics at Simple - jklukas
https://www.simple.com/engineering/building-analytics-at-simple
======
jklukas
I wanted to mention some things that are still struggles with our current
platform at Simple.

Maintaining our Redshift schema can be time-consuming, especially since there
can be upstream changes that affect the schema of the messages we consume, and
we often aren't notified. We'll just be silently dropping the new columns, or
loads into Redshift will start to fail. Postgres 9.5 is supposed to include
DDL messages in logical decoding, which will help for our pipeline from
Postgres databases, but we'd love to develop a more robust system for
automatically updating the Redshift schema to accommodate changes in the
incoming data format. Naive solutions would likely be susceptible to unwanted
behavior coming from spurious messages. We could end up with a large number of
mostly empty columns in some cases.

We've also had trouble maintaining knowledge about the schema. We'd like to
have some sort of "data dictionary" that would be available for folks to
understand where the data in Redshift is coming from and what each of the
columns mean. This might end up being a standalone document, but there's
concern about it staying up to date with changes in the schema. We've toyed
with the idea of maintaining this sort of documentation as Javadoc-style
comments directly in the repository where we define our Redshift schema and
migrations, but we haven't found any existing tooling to render documentation
from comments in SQL source files.

------
jklukas
A question on Twitter
([https://twitter.com/sayhar/status/671867539480305664](https://twitter.com/sayhar/status/671867539480305664))
asked about how Simple decided to do ETL in Redshift.

I think the main benefit is flexibility. Since transformations are happening
in the same system where the data already lives, there's no I/O overhead and
the performance is pretty fast. Our full suite of nightly ETL jobs runs in
about 2 hours, and that's without any effort to run jobs in parallel. There's
also minimal schema maintenance involved; the nightly jobs completely drop the
existing tables, and in most cases the table creation is handled automatically
as part of a CREATE TABLE AS statement. That means we're pretty free to add,
drop, or rename columns without needing to worry about migrations or keeping
the schema in sync with what some external tool is doing.

That said, no one on our team has deep familiarity with ETL tools, so there
may be other good options that we simply haven't investigated.

~~~
joeharris76
I do have experience with ETL tools. The dirty secret of ETL tools is that
they are slow in the best case. In the worst case extremely, head meltingly,
how-could-it-possibly-take-that-long slow. Before MPP and columnar DBs
(Vertica, Greenplum, etc.) were widely available ETL tools were a good trade
off. You were offloading work that (in many cases) literally could not be
completed on either the source DB or the target data warehouse.

When we (my company at the time) got Netezza in 2004 we immediately realized
that we needed to move onto ELT. I.e. the ETL tool (SQL Server DTS at the
time) became just a scheduler running plain SQL against Netezza. It was
massively faster to do it that way and we could put the ETL server budget into
Netezza.

ETL vendors, e.g., Informatica, now have "pushdown" options where that you
design your workflow in their tool but it executes as SQL inside your
database. However, when you use a tool like this you're very constrained in
the types of processing you can do, for example Informatica doesn't have the
ability to parse timezones (yes, really).

We evaluated basically _everything_ that claimed to support Redshift and
couldn't find anything that was viable for what we need. Some tools were close
though so maybe someone can close the gap.

------
joeharris76
We've struggled with the same issues. We also keep a "base" schema in Redshift
as a close mirror to source systems (tweaked a little for distribution keys).
In our case we control the extracts but schema changes have to be manually
added. If a new column is important and we miss it then we have to run time
consuming backfills. Then only way we've been able to ameliorate it somewhat
is to be more involved in the upstream release process but that is time
consuming in itself.

The solution I'd like to implement is having the extracts and loads be built
dynamically whenever a change is detected. E.g. have a process that runs
separately checking the schema of our sources for changes, dynamically
updating the ETL processes and then publishing them to Azkaban. Looking
forward though our biggest data sources are going non-relational so that
approach will entail a great deal more complexity.

Regarding schema knowledge / metadata, this is a perennial problem. I once
worked on a huge Teradata DW system at a global bank who had a large internal
website for just this purpose which was hugely out of date at all times.

When I started building out the analytic capability at my current company I
made a fork of "Rocco"
([https://github.com/joeharris76/rocco](https://github.com/joeharris76/rocco))
for doing inline documentation as Markdown inside SQL block comments.
[http://pow.cx/docs/](http://pow.cx/docs/) is a good example of what I was
hoping to achieve. However even our smallest components have many, many files
and the docs were too difficult to navigate using the "Jump To…" dropdown
menu.

So, while this produces some very nice docs, sadly they turned out to be
largely unhelpful to anyone who didn't already understand the code. Our
current approach is to try and keep individual components small, include a
comprehensive README in each component and use Slack comms to fill in the
gaps.

~~~
jklukas
Being able to adapt dynamically to schema changes would be great. If you do
develop a solution for that, please do publicize it. Simple uses segment.com
for some communication of analytics info to third parties, and Segment folks
have built some capabilities to infer schemas from incoming messages which
they then apply to history tables in Redshift. We're hoping to learn more from
Segment about how they make that happen.

What you described with documentation efforts is kind of scenario I fear. We
could spend significant effort putting together a documentation system that
simply doesn't meet a real need. It's good to know other folks are also
struggling with that, though.

~~~
joeharris76
We're very interested in trying AWS' new Database Migration Service to see if
it provides a nice, automated way to get a mirror of the source into Redshift.
[https://aws.amazon.com/dms/](https://aws.amazon.com/dms/)

