
Building a Data-Informed Culture - edawerd
http://engineering.gusto.com/building-a-data-informed-culture/
======
xenity7
A lot of this is pretty typical data warehouse stuff, but I really like the
concept of allowing "anyone" to creat ETL jobs rather than relying on a more
strict segregation of duties.

I've had the most success allowing free access to warehouses data with some
guidance given around how to use it. The data team doesn't need to be the
gatekeeper of all access to data - a better way is to consider them expert
advisors on how to use it.

Getting bottlenecked with simple requests because you refuse to give technical
end users full read access is silly.

~~~
v64
At a previous gig with an open data warehouse, we had problems with teams
pulling data incorrectly (e.g. counting visits without using the correct
combination of filters to get the "official" count) and using them in company
documents and press releases.

The solution there was to cut off access and force such requests to go through
the data team. What's the solution if you still want them to have access to
the raw data? No matter how much documentation we produced, it never stopped
someone from screwing up a join and getting the wrong information as a result.

~~~
dasternberg
We actually control access to our raw data sources. The vast majority of
people looking at data look at data we surface in our BI layer in Looker (or
in Aleph if they know SQL).

I didn't focus on this in the article, but in addition to removing sensitive
information, that layer simplifies a lot of the complexities of our app's data
models. Core calculations like the one in your parenthetical example live in
core dashboards that the Data Analytics team owns.

Raw access is limited to our applications like Airflow and a small number of
authenticated users. So an engineer can easily go into our Airflow repo and
add some new columns to a table in the BI layer by updating the SELECT clause
of the query that creates that table, without needing to access the raw data
themselves.

~~~
v64
When I say raw data, I mean the data that's already been processed from
original sources and stored in the data warehouse (that is, the warehouse data
as it existed in our database and not how it was presented through third party
tools like Tableau). I agree that for high level stuff like visits, etc,
dashboards help out a lot.

For some more specific examples where open access broke down:

1) We had a fact table with vendor IDs and service start/end dates. If the
service end date was null, that meant service for that vendor was active. To
correctly pull the number of active vendors, you would want to do something
like "select count(distinct vendor_id) from f_vendor_service where
service_end_date is null". Users often queried this table without distinct in
the count and without the where clause, resulting in inflated numbers. We did
end up adding "Number of Active Vendors" to a dashboard, but then people would
return to the warehouse table to get the number of active vendors in 2015,
etc. and their clause would be wrong.

2) Our vendors could be classified into one or more categories. Among the
vendor's categories, one category was selected as the primary category. There
were cases where people wanted to pull counts on how many vendors were in X
category, but they would always end up querying for how many vendors had X as
the primary category instead of how many vendors had X as any category. We had
to provide a dashboard to get these numbers reported correctly.

3) We had a pageview_log table, and people would want to examine URLs to see
what the least visited pages were. However, their regular expression ability
was lacking, and despite our documentation, people would still use bad regexes
and pull bad counts that would be used to justify product decisions for
shutting down sections of the site. Again, we put this behind a dashboard so
people wouldn't query pageview_log manually.

So, from my point of view, curating information behind dashboards isn't giving
people access to the raw data. The original desire was for teams to create
their own dashboards, but due to their bad queries, the data team ended up
having to spend a good chunk of their time maintaining all the dashboards in
the company. When you have people writing raw queries in Aleph, how do you
prevent them from making the same types of mistakes that I illustrated?

~~~
dasternberg
In my experience you have to strike a balance between allowing exploration
that unblocks people outside the team to look into data (with a bit of a
buyer-beware attitude) to help them build intuitions while making sure that
core metrics that you use internally and/or share publicly are appropriately
vetted. Using Aleph or other shared SQL tools like it allows us to quickly
look at a query someone else on another team has built and provide support.
Making time to walk through core tables and their pitfalls as part of SQL
training can also help.

Reporting on core company metrics needs to be curated and owned by people with
the right expertise. In some cases we'll build additional views on top of our
core BI layer that simplify a lot of this information for some of the examples
you're describing (like the number of active customers at different points in
time) and use to drive our dashboards, and people can query from those. Aleph
also supports tagging and ad hoc parameter setting, so you could fix and build
off of more novice users' queries, tag them as "official" and allow them to
set parameters like time ranges.

But at the end of the day, AFAIK there's no silver bullet here - you have to
train people in the pitfalls of your data and expect that they'll make some
mistakes but it's part of the data team's responsibility to help them learn.
As individuals in different groups gain expertise, they can help each other
too. The alternative extreme, where a priesthood owns all data analysis will
avoid mistakes, but it won't allow your team to work on activities that are
both high leverage for them and provide more value to the company.

~~~
v64
It sounds like we could've used something like Aleph to centralize the queries
that were coming in, as opposed to the case where somebody copy and pastes a
query to someone who adds a column and then emails it to someone else, etc
etc. And you're right, there is no silver bullet, so it's always good to hear
how other folks are making it work. Thanks for the response!

~~~
dasternberg
Aleph is an open source Rails app and some great folks where I used to work
built it - worth a look:
[https://github.com/lumoslabs/aleph](https://github.com/lumoslabs/aleph)

