
Ask HN: What does your BI stack look like? - Dwolb
How does your company visualize and report on what’s happening in the business?
======
seveibar
Postgres -> Metabase

I believe this is the best combination of cheap/powerful for early-stage
startups. My very non-technical cofounder is able to use metabase's simple GUI
interface to create graphs/insights (even joining and aggregating across
tables!), and for anything complex I can step in a give a helper SQL query. We
have around 10M records we aggregate around for daily insights.

Metabase can also be run as a pseudo-desktop/web application to save
additional cash (we don't do this though).

~~~
mazameli
UX guy from Metabase here. Glad to hear it's working well for you. Happy to
answer any questions folks might have.

~~~
Dwolb
Wow great to see you in the thread. No questions from me but great job on the
latest update.

~~~
salsakran
Thanks!

-Another Metabase team member

------
dwl285
If you have a data team: Stitch / Segment -> BigQuery -> Dataform -> BigQuery
-> Looker

I work with many companies helping them set up their data stack, and from what
I've seen this is pretty much the optimal set up. These tools all require very
little maintenance, are relatively cheap (compared to the man power required
to set these things up from scratch internally), and scale well as companies
grow and complexity increases.

If you don't have a data team: Segment -> Amplitude/Mixpanel

If you don't have a dedicated data team, you can do a lot with tools like
Amplitude and Mixpanel. Get your engineering team set up with Segment and
tracking all of the important interactions within your product, and set
Segment to pass those events to Amp/Mix. Self serve analytics in these tools
is then pretty easy and will serve you well until you start hiring a data
team.

Full disclosure: I work for Dataform (used to be a customer, loved it so much
I decided to join them)

~~~
alexfromapex
Thanks for sharing, I don’t think there’s an “optimal setup” in the objective
sense but good to know this works well.

~~~
dwl285
Good point. I guess I mean, if you put a gun to my head and forced me to pick
one, this is what I'd pick (and I think it works for a pretty meaningful
percentage of companies that are looking for an answer to the original
question)

------
rajivayyangar
As a data scientist (startup / Yahoo) turned product manager (4 startups),
I've used a variety of stacks in the past - from plain SQL, to Mode, to
Mixpanel, Looker, Interana, and Hive.

Recently we started using PopSQL ([https://popsql.com](https://popsql.com))
and love it.

When I don't have a dedicated data team, my philosophy is:

1) Make it difficult to get wrong answers

\- Don't use Google Analytics. It's too easy to generate incorrect charts, and
too difficult to verify them.

\- Have a limited sandbox of reports for non-SQL writers

\- Keep the SQL close to the report, so it's easy to verify the underlying
query.

\- Push people to learn even basic SQL

2) Make it quick and easy to ask iterative questions \- PopSQL is way faster
than Mode. Like 20x faster.

3) For metrics that matter (e.g. KPIs), instrument them directly and even
build a custom analytics dashboard if it's important. (beware dashboard
clutter!
[https://twitter.com/andrewchen/status/1193619877489192961](https://twitter.com/andrewchen/status/1193619877489192961)
)

~~~
netcraft
$20/user/month seems incredibly expensive for an editor that we would have
traditionally paid a one time cost for. I assume this has the saving and
sharing of queries built in, but that is a lot of money for some storage. Are
there other major benefits I am not seeing? (It is not my intention to attack
you or make you defend this product, I am just curious if I am missing
something - there are a lot of similar products with similarly questionable
value propositions)

~~~
rajivayyangar
Saving, shared queries, easy link sharing, and basic charts with some
intelligence. It also has excellent query UX (auto-complete, etc.).

Most of all, it's really fast. It fundamentally depends on your DB of course,
but PopSQL doesn't add any extra bloat the way Mode does.

------
numlocked
At Grove, we are:

Airflow -> S3 -> DBT with Spark/EMR or Redshift/Spectrum -> Redshift data
marts -> Looker

At least, that’s the way we like our pipelines to work. In practice we have a
couple of extractions that land directly in Redshift (we extract Zendesk data,
for instance, with Stitch Data). We use Snowplow for click stream analytics.
And we’ll likely move from Redshift to Snowflake (or mayybbbeee Bigtable) in
Q2 of 2020.

We used to do all of our transforms via DBT within Redshift but have been
offloading the heavier-duty pieces (like Snowplow event processing) to Spark
jobs because they were too taxing on Redshift.

We’ve gone through 3 “generations” of Looker reporting — gen 1 was just
wrapping LookML around our schema and forcing Looker to do the joins and
generate SQL for everything we wanted to know. Second generation involved a
bunch of common table expressions within Looker itself that worked, but were
developed without much thought as to data mart design. Gen 3 is where we are
now with super deliberate scoping and implementation of warehouses in DBT.
Before any of that we rolled our own tool [0].

Happy to answer any questions. We’ve gone from ~30 to ~1000 employees in the 3
years since we started using Looker and have learned a thing or two along the
way.

[0] [https://github.com/groveco/django-sql-
explorer](https://github.com/groveco/django-sql-explorer)

~~~
pcarolan
Very similar to our progression, do you have a good way you manage your data
dictionary and catalog?

~~~
stevepike
I don't want to derail the thread, but I recently started a company
([https://syndetic.co](https://syndetic.co)) that's working on this problem.
We've been focusing on the external data dictionary use case (how does a data-
as-a-service company explain a dataset to their prospective customers) but
we've been encountering a number of companies that are evaluating data
catalogs and other internal tools for their data science teams.

I would really appreciate getting your perspective - I'm steve (at)
syndetic.co

~~~
pcarolan
Was just thinking this would make a great business opportunity. Thanks, we'll
check it out!

~~~
dataminded
This space desperately needs some new options.

The legacy players are a mess, the new cloud native offerings are overly
Engineering focused and immature, and the few startups in this space are
recreating the legacy solutions in a SAAS formats

Go forth and claim the significant checks that I and others plan to write for
these mediocre offerings.

------
vitorbaptistaa
Luigi, AWS S3, DBT, Snowflake and Re:dash (currently analyzing Metabase or
Looker to allow queries without SQL)

Luigi runs our scrapers and other workflow management tasks (e.g. DB backups).

All raw data lives in S3. We make an effort to be able to recreate the whole
data warehouse from the raw data, so if any cleaning/normalization process
fails, we have this safety net. I'm curious to hear if others use a similar
pattern, or if there are better options.

DBT handles both loading the data from S3 into Snowflake (by creating the
Snowflake Stages), and transforming the data in Snowflake. This isn't how DBT
is used usually, but it felt wasteful to add Stitch or another tool to load
the data into Snowflake, as snowflake supports it out of the box. I also
created a `setup_snowflake` macro operation that creates our users,
warehouses, databases, etc., in Snowflake (a kind of "poor man's Terraform")

I don't think Snowflake requires introduction. It's an amazing tool. We used
Postgres before, but Snowflake is much much better, even though our DB is
pretty small (~200 GB).

Finally, we use Re:dash as a BI, but I'm checking other options that allow
usage without SQL (currently Metabase and Looker).

~~~
huy
You might want to check out Holistics.io, allowing your non-technical to
explore data without writing SQL.

It's an alternative to Looker but more affordable, and more powerful than
Metabase.

~~~
beckingz
If you setup your data well and have denormalized data tables, metabase is
really good.

------
pcarolan
Looker on top of Redshift. Events streamed in from Segment and ELT transforms
managed by Airflow. Looker gives you nice visualizations and charting with
RBAC and some some lightweight ETL functionality. The real advantage of Looker
is their modeling layer which sits on top of your physical data and is done
via a DSL called LookML. Source control is managed via a wrapper around git.
The end result is that analysts can move lightning fast building their own
models in SQL or do analysis via drag and drop explorations. Looker's customer
support is the best I've experienced and hasn't changed since Google acquired
Looker. We're likely moving off Redshift to Snowflake in the next 6 months
because it is slow to pick up changes and we want to decouple storage and
compute for scaling reasons. Airflow is an incredible platform but needs
customization and templated workflows to make it great. Data build tool (DBT)
is how we plan on managing ETL code via Airflow in the near future. We're also
adding Spark, but more for data science.

~~~
dwl285
If you're interested in an alternative to dbt that's a little more analyst
friendly, check out Dataform. A lot of teams are using Dataform between the
warehouse and looker, to handle transformation logic using pure SQL, but in a
more user friendly format than Airflow. Get in touch if you'd like to chat!
dan at dataform.co

------
thecodemonkey
We're a small, bootstrapped company with 2 people. Some data is millions of
rows others are billions.

Goal for us is KISS. Keeping everything as simple as possible -- both in terms
of infrastructure, ease of use and cost.

Primary we're using Metabase in front of multiple MariaDB instances. Metabase
is absolutely fantastic!

We also have a couple of additional small projects:

\- A small Node.js app that receives events live via fluentbit, saves them off
in MariaDB and sends text message notifications via Twilio when certain
thresholds are exceeded

\- A small "ETL" app that polls the Stripe and QuickBooks API to mirror data
in a MariaDB database so we can easily access it from Metabase

------
veritas3241
Stitch/Fivetran/Custom -> Snowflake -> dbt -> Periscope

Lots more documentation here [https://about.gitlab.com/handbook/business-
ops/data-team/](https://about.gitlab.com/handbook/business-ops/data-team/)

We have many of our KPIs embedded in the handbook (look for KPI index mapping
link).

This is also our primary project where you can see all of our dbt code
[https://gitlab.com/gitlab-data/analytics/](https://gitlab.com/gitlab-
data/analytics/)

~~~
iblaine
The gitlab wiki has some amazing content on it. Thanks for making it public.

------
crustacean
Sorry to be that one avoids-the-question comment, but:

Without design sessions to figure out your data store design (look up Kimball,
Immon), and then monitoring/testing to make sure everything is running
smoothly, any data stack will be hell.

Badly designed data brings fatigue and rancor and unnecessary work and is a
huge hard-to-quantify money suck.

~~~
dwl285
IMO the right solution to this is not spending a bunch of time upfront trying
to design your warehouse. You just want to adopt a tool that makes it easy to
continuously refactor into more optimal formats as your warehouse scales.
Happy to discuss more if you're interested! dan at dataform.co

------
grahamdietz
We use Snowplow directly into Snowflake and report on this using Mode. We are
a data native SaaS firm, and we set things up like this about 5 years ago and
it has served us well. Streamlined and high performance. For all other
sources, we use Stitch -> Snowflake, with one semi-custom Singer.io tap also
running on Stitch. All this was simple to set up and means we don't have to
worry about getting the data into one place. Of course, we then have lots of
work to do in Snowflake and Mode to get the data the business needs. We share
the reports from Mode to various teas via Slack. Hope this helps.

~~~
dwl285
Are you using Dataform Graham? I don't think we've spoken if you are, I'd love
to chat! dan at dataform.co

------
aaronharnly
At Amplify, Matillion/Stitch/Fivetran/Custom -> S3 + Snowflake -> Matillion
for transforms -> Looker + Tableau.

It's important to say out loud that a lot of analysis also happens within
product-local reporting systems, or in "offline" Google sheets.

We are currently working on building out the same stack, terminating in a
Powered By Looker instance, for customer-facing aggregate reporting.

The engineering and data science teams do great with Looker, but Tableau still
covers use cases for non-engineer business people (think: the finance,
customer operations, training, etc) who want to start from source data plus a
few sidebar Google sheets, tinker around, and then develop a rough report
without waiting for iterative cycles with the data engineering team. We're
thinking hard about how to get those use cases into the warehouse + Looker
faster.

------
nm2259
Data infrastructure is scattered, siloed, excel sheets and google sheets
stored in various places (personal g drives, company g drives, some network
share somewhere, an ms sharepoint site, sometimes in development git repos or
various wikis)

Reporting infrastructure is manual massaging and lots of powerpoint.

My company isn't that large, but bigger than you'd think for such a "system".

~~~
ska
> but bigger than you'd think for such a "system".

I suspect you underestimate how large some of the organizations are that use a
variant of this system.

------
thunderbong
After reading the comments here, I decided to give Metabase[0] a shot. Awesome
analysis tool. Very impressed. Works even with large SQLite databases.

[0]:[https://www.metabase.com/](https://www.metabase.com/)

~~~
elwell
Being a Clojure [0] team, we naturally use Metabase as well. Analyzing a
Postgres data warehouse input from Segment, as well as our production db.

[0] -
[https://github.com/metabase/metabase](https://github.com/metabase/metabase)

------
pantene
Data warehouse:

\- Amazon Redshift (data sourced from many different backend DBs; e.g.
PostgreSQL, MySQL, other Redshift instances etc.)

\- BigQuery (Some teams store the data for their projects here. For reporting,
they're selectively transferred to Redshift.)

Reports:

\- Tableau (extracts of pre-aggregated data from Redshift)

\- Looker (connects live to and executes its own queries on Redshift)

Anything that is based on pre-aggregated (rather small, e.g. n_rows < 5mil)
data is visualized on Tableau. If users want to work on a self-service tool
they use Looker which can connect to (cleaned) raw data optimized for this use
case.

ETLs for raw data run daily on Redshift. Reports are also scheduled on
Redshift (with Airflow) but the refresh interval is report-specific.

------
lifeslogit
We keep it very simple as well.

Postgres read-replica for cheap realtime stuff, Redshift for the larger sets,
Airflow to manage transfers, and Metabase to visualize and share. We also
collect data from a bunch of APIs, but those each run via their own job and
store in Postgres.

We also try to define wide short tables in Redshift that enable users to use
Metabase to ask questions.

I was very happy with Metabase. Being that we can't afford Looker right now
(but we would all love to) it is pretty solid.

------
mickeyben
Stitch/Airflow/Other -> Snowflake -> dbt -> Snowflake

Everyhting goes through S3 because Snowflake storage is on it.

dbt is amazing, we began using it a month ago and it already transformed the
way our data team work. It really is a value multiplier for everyone. Data
engineers are happier because they don't need to write and maintain data
transformations, analysts are happier because they can maintain their own SQL
pipelines & the whole company is happier because we now have a great
documentation tool to explore our data.

We also are big fans of Snowflake, make operating a data warehouse a breeze.

Then, we use a mix of Redash & Tableau for reporting.

Redash for static reporting (open to the whole company) & Tableau to create
more complex data tools we expose to some internal teams; Marketing, Risk,
Finance ...

------
asati
Postgres (read replica) -> Redash

Segment -> Amplitude (but using it less and less)

I am surprised no one mentioned [https://redash.io/](https://redash.io/) till
now (a lot cheaper than looker/mode/Periscope with all the basic functionality
that you might need).

~~~
LunaSea
Redash is far too expensive for what it does imo. Haven't used the other
solutions though.

------
buremba
We use Segment for event tracking, Postgresql for transactional data and a
number of spreadsheets and third-party integrations with Stitchdata. Since our
data is relatively small, we use PG as a data-warehouse and heavily use DBT
for ETL. The people who are familiar with SQL just use DataGrip, for the UI we
use our tool [https://rakam.io](https://rakam.io).

Shameless plug: It's basically similar to Looker but it plays well with the
product data and integration with Segment as well.

------
sejtnjir
Informatica -> ADLS -> SQL Server -> PowerBI also, in the same department:
NiFi -> HDFS -> Spark -> Hive -> NiFi -> ADLS -> PowerBI and: NiFi -> Azure
Event Hubs -> Azure container instances -> Event Hubs -> Streaming Analytics
-> PowerBI

I'm pretty fond of the last stack for streaming dashboards in the sensor data
realm.

------
chrisjc

        -> Kafka-connect -> Snowflake -> SQL/sf-tasks -> Snowflake -> Looker
        -> Alooma        ->
        -> custom        -> 
    

Using Kafka-connect, we're able to serve up near real-time (2-5 mins) insights
on device generated events.

We probably need to use some kind of ETL tool to replace custom SQL and sf-
tasks. Unfortunately, we haven't been able to find a tool that handles this in
a non-batch (even if it's micro-batching) form. Snowflake change-streams and
tasks allows us to ETL in a streaming-like fashion.

We're ingesting everything from raw/transformed/aggregated events, micro-
service DBs (as fast as they sprout up), netsuite/salesforce, mixpanel, MySQL,
MongoDB... Billions of rows of data across multiple data-source accessible to
internal and external customer in a matter of seconds. It's been an incredible
challenge, especially with only a team of 2-5 people.

------
kfk
Python, Redshift, Tableau. But if you are starting from scratch I’d suggest to
focus on the etl piece with python and send pdf reports generated with latex.
Too many people get distracted by the fancy reporting stuff and don’t do the 2
things that matter: good etl with good sql db; analytics that is tied to
results the company cares about

------
iblaine
It's interesting seeing the various stacks being used...Here at One Medical:

[ onemedical.com, mixpanel, Google Sheets, Salesforce, etc ] -> S3 (Amazon
EMR) -> [ Tableau, Aurora MySQL ]

It's a nice & clean stack for data engineering.

Airflow is used for orchestration and is heavily customized with plugins.
Zeppelin notebooks are used by most analysts.

We'll probably be replacing Aurora MySQL w/an MPP database like Snowflake or
Redshift. MySQL is a crutch, but easy to deploy and manage to a point.

Several python frameworks also do various ETL & data science functions, but
everything generally revolves around S3 & Airflow.

Amazon EMR is a great swiss army knife for ETL. Moving between Pig, Spark &
Hive is trivial. Tableau is a bit of a challenge. Tableau seems to give users
too much rope to hang themselves with.

Also, we're hiring:
[https://www.onemedical.com/careers/](https://www.onemedical.com/careers/)

------
soumyadeb
In my previous company, we did TreasureData->Tableau.

TreasureData is a platform like Segment and lets you bring your customer event
data as well as data from SaaS tools (like Salesforce, Zendesk) into a data
warehouse hosted by TreasureData. It worked great but had the downside that
all the data was in TreasureData and we were kind of locked into it. Segment
kind of solves that problem because it has connectors to Redshift/Snowflake
etc so you can keep ownership of your data warehouse but the warehouse sync
delay (in our version) was a problem.

Also, BI was just one of the use cases. We wanted to send the data to 3rd
party tools (like Facebook ads) based on some logic (some were simple rules
but we had complex ML driven lead scoring models too). TreasureData was more
flexible on being able to run those rules and activate the results but ideally
we wanted to run them on top of our own warehouse in AWS.

------
thenaturalist
Sharing an awesome business intelligence tools list [0] I created ca. 2 years
ago to get a better understanding of what's out there.

[0]: [https://github.com/thenaturalist/awesome-business-
intelligen...](https://github.com/thenaturalist/awesome-business-intelligence)

------
gavinray
It covers many more bases than business intelligence, but Forest Admin. And
most of its functionality is free.

[https://www.forestadmin.com/](https://www.forestadmin.com/)

It generates a very beautiful CRUD admin dashboard automatically via
reflection.

Allows building drag-and-drop data viz dashboards, saving commonly-used custom
queries as "scopes", and even building your own HTML views if you need to get
really fancy (think tracking live deliveries on a map, etc).

Also has Stripe and Intercom integrations.

I really can't hype this enough. Have been using this on nearly every app I've
built the past three years.

The core team also answered my emails as a never-paying customer within 2-3
days the few times I have mailed them over the years I've used it.

~~~
dinkleberg
Wow, that looks incredible. I don't know why, but I've never really thought to
look for admin site solutions out there, instead opting for janky in-house
built ones. I'll definitely be playing around with this.

------
mjirv
Stitch -> Redshift (with DBT for data modeling) -> Looker

For a smaller company, it makes a lot of sense for us to use off-the-shelf
tools like these rather than rolling our own pipelines. Every once in a while
we run into issues where Stitch isn't quite as flexible as we'd like (can't
set dist/sortkeys etc), but that's the tradeoff you make for its simplicity.

DBT is amazing and I can't recommend it highly enough!

Looker works for analytics, but we're starting to do more data-sciency work,
and it doesn't have built-in Jupyter notebooks or anything like that. Does
anyone have a solution they use and like for hosting and sharing notebooks
alongside a stack like this?

------
mmckelvy
A lot of these answers seem to focus on app analytics (e.g. collecting clicks,
page views, etc. from Segment). How are people collecting / integrating
financial data (e.g. sales, subscriptions, expenses)?

~~~
chrisjc
So how are you collecting/integrating financial such as what is contained in
salesforce/netsuite/etc?

My recommendation would be

\- an ETL as a service such as fivetran, stitch, etc

\- into a data warehouse such as snowflake, big-query, etc

\- transform/aggregate/normalize/magic data using some ETL tool such as dbt,
matillion, etc

\- analytics tool such as Looker, Tableau, etc...

------
mharroun
Ive done 2 data pipelines, one alot like what most people are are talking.

The other I had to build for a startup with millions of monthly uniques but
only seed funding (cant do a 30+k a month data eng bill).

Went with custom event emission->kenisis->(druid & S3) and used imply
([https://imply.io/](https://imply.io/)). Easy real time data analytics, auto
"enrichment with druid lookups from a RDBMS, and a simple ui for slice/dice
investigation metrics. All in all costed lest then the cheapest looker
license.

------
cozuya
We made our own reporting. Seems crazy to pay multiple 3rd parties to look at
our own data. We're using Apache Druid on the back end. (Giant fortune 50 non
tech company)

~~~
korse
SQL server -> R server running on CentOS

Pipe stuff to Domo for the investors.

------
topogios
Daily ETL in form of sql scripts from prod dbs to internal data warehouse for
use via Saiku, case-by-base Shiny apps and some bash email-reporting.

------
wilbo
Postgresql > pgadmin

For less technical people > metabase

For automated reporting and storing historic trends > Klipfolio

For near real time automated operational reporting > kloud.io

------
css
Formerly: Oracle Hyperion -> ETL to Azure -> PowerBI

Currently: Internal Data Warehouse -> RDS -> Internal web app (Django, React)

~~~
arethuza
Was that Essbase or HFM? I had lots of fun once getting bulk data out of HFM
and into spreadsheets via a web app.

~~~
css
Essbase, which made it a pain to get everyone off of Excel.

------
zshrdlu
Custom ETL -> BiqQuery -> Datastudio & Metabase

We initially considered Stitch and other -as-a-service ETL but ~500 lines of
Python later we had our own thing. I also experimented with FDW:
[https://unwindprotect.com/postgres-fdw](https://unwindprotect.com/postgres-
fdw)

------
mister_hn
Does Excel qualify here?

------
elwell
On a meta note, if you're interested in viewing & sharing stacks, that's the
primary feature of the startup I'm working on: Vetd (app.vetd.com). The
communities we host (often VC portfolio companies) share their stacks and
leverage for discounts.

\- CTO (chris at vetd.com)

~~~
chrisjc
What does this have to do with BI stacks?

~~~
elwell
Companies using our platform add all sorts of items to their stack, including
BI tools/services.

~~~
chrisjc
Nice. So you do hypothetically provide a stack similar to

    
    
        Fivetran -> Snowflake -> Looker

~~~
elwell
We don't currently have any way to define relationships between products, but
I like that idea...

It looks something like this [0] when you view the aggregated data of the
companies in your community, and you can also view each company's individual
stack.

[0] - [https://imgur.com/a/UXLN5KJ](https://imgur.com/a/UXLN5KJ)

------
mattbillenstein
We use BigQuery as the data warehouse - there are airflow jobs to periodically
load data into BQ - most of these are simple python scripts.

Metabase for most of our simple BI metrics - Tableau for some advanced users
doing more complicated stuff.

------
jconley
App, Web, IoT device send realtime events to a fluentd system with many events
going to Segment. Web front end also loads Segment for various ad trackers.
IoT device also uploads telemetry and other logs in both realtime and
asynchronously and those end up in some Postgres databases.

Segment syncs our event data periodically to our data warehouse (Redshift).

We have a readonly replica of the eCommerce DB for live stats (Postgres).

And there is a time series db for system/IoT telemetry (InfluxDB).

Most of our BI queries are done in Mode. Some are done in Grafana (data in our
InfluxDB and some live data). Spot check against Google Analytics or FB ad
tracker...

------
beckingz
MariaDB > Metabase Stitch/Segment > BigQuery > Metabase Stitch/Segment >
BigQuery > Google Data Studio (Curmudgeonly stakeholders refuse to log in to
BI tools... but require reporting anyways...)

------
exabrial
MySQL -> replication -> MySQL -> Metabase & Tableau

We want to switch to postgres because of features, but "The devil you know is
better than the devil you don't", so we just kinda sticking with MySQL.

------
kmerrol
A bit different from the crowd here. After much searching, our BI stack has
been fully converted to data streaming using: StreamSets DataCollector >>
Kafka >> AWS S3 >> AWS Redshift >> Dremio >> Jupyter Notebooks. Great to have
Jupyter take on data prep and data analysis tasks while external tables in
Redshift are very fast with minimal ETL. Dremio has been great as a virtual
data mart with great performance on Parquet data in S3 and desktop BI tool
integration.

------
adrianN
I work at Snowflake where we're dogfooding our own stuff.

------
mike_lee_28
We've recently changed from : Segment -> BigQuery -> Looker to Segment ->
BigQuery -> Dataform -> BigQuery -> Looker.

The addition of Dataform into our stack has completely changed the game,
allowing us to maintain one source of objective truth and maximise looker
(minimising the joining etc. in LookML, instead pointing it straight at tables
in BigQuery).

This in turn means our downstream data users are happier to use a fast and
reliable looker instance!

~~~
thenaturalist
Your first comment ever?

Any affiliation with DataForm?

------
huy
Postgres, Pipedrive, Zendesk -> BigQuery, then BI using Holistics.io

Holistics handles both ETL, transformation and self-service visualiation
(Looker alternative), all in 1 tool.

------
sixo
Custom pipes mostly coordinated by Luigi (heavily customized) -> Redshift ->
DBT -> Looker.

Some spark mixed in at the ingestion and transformation phases.

Like someone else said in this thread, we're currently battling Redshift
scaling problems and are looking to offload more of the raw data to S3 using
Spark to create read views.

No data catalog right now but the Looker setup is extremely well-maintained.
Hoping to deploy Lyft's Amundsen to fix this more generally.

~~~
buremba
Since your Looker setup is well-maintained, did you consider adopting the data
catalog in Looker, what would be the challenge doing it?

~~~
sixo
The main reason is that only a subset of all our datasets are actually exposed
in Looker, and Amundsen would have applicability to other data sources -
certain production semi-OLAP DBs, the future hive-on-S3 deployment... that we
have no plans to expose in Looker at all.

That said I haven't looked closely at how far we could go with Looker, I'll
check it out.

~~~
buremba
I see. We actually develop a BI tool similar to Looker but also want to
implement a data catalog feature embedded into it. I'm familiar with Amundsen
but I believe that this kind of tool needs to have tight integration with the
BI tool and Amundsen still has a way in terms of integrations. (searching
among saved reports, dashboards and collaboration with teams, etc.).
Otherwise, using spreadsheets is still easier and maintainable.

I would love to hear the features that you would need in a data catalog
system.

------
thingsilearned
Dave from Chartio here, wanting to share our new book describing the 4 stages
of setting up your ideal data stack here - [https://dataschool.com/data-
governance/](https://dataschool.com/data-governance/).

It covers BI a bit, but mostly the stack that BI sits on top of. It's an open
book so we're always looking for suggestions and experiences such as those
shared here.

------
TraceOn37
We built a tool to help non-technical folk transform their data into useful
formats for BI ([https://www.data-monkey.com](https://www.data-monkey.com)).
The tool currently supports JSON, CSV, Excel and text files, and comes with
features to merge/filter/transform data. It's free to try and use if you're
interested - we'd love to hear your feedback!

------
sphix0r
We built our own tool at Datastreams
([https://datastreams.io](https://datastreams.io)) to collect data. We
currently collect several thousands of events per second, mainly web data.

Events are mainly streamed to one of the following: Cloud buckets(S3, etc),
HDFS, SQL-db or Cassandra.

Most clients use one of the following visualization tools: PowerBI, Qlik or
Tableau.

Our clients are mid to enterprise size.

Disclaimer: I work at Datastreams

------
chrisacky
Is there any way that we can provide BI to our customers using an OSS tool?

Our databases store all our users data. I'm thinking of using something like
pgsync to sync all database postgres to a new postgres and then having redash
or metabase set up to connect.

Alternatively using locked filters on metabase and embedded questions.

All our data for our users are in postgres and they very much want BI
insights. Not sure how easiest way for this...

~~~
seektable
For his purpose you can check 'Team Sharing' feature of our BI tool
(SeekTable): [https://www.seektable.com/help/share-report-to-
team](https://www.seektable.com/help/share-report-to-team)

------
timosch424
Ideally: Podio , excel sheets, other program backend-> Pentaho ->
MySQL/MongoDB on Google Cloud, Google Analytics-> R scripting(DBI connections,
BigQuery and Hadoop managed through R) -> shiny-proxy, APIs, Hadoop for Big
Data computation-> Rmarkdown Reports, shiny-dashboards, a little Tableau and a
little Power BI.

MySQL will be moved to Postgres for better Performance soon.

------
subhajeet2107
Custom Collector(analytics) -> Clickhouse -> Custom ETL Scripts -> Clickhouse
-> Re:dash We tried metabase which is awesome , but Redash is also great and
easy to setup as well if your team knows sql then Redash is better We also
looked at druid and after some benchmarking we settled on Clickhouse, realtime
queries even without etl runs within seconds in clickhouse

------
rmk2
We are a small data consultancy, so we use other/more diverse things for
customers, but our internal stack is fairly simple:

Hubspot/Jira/G Suite → (Python) → PostgreSQL → (SQL) → Tableau

Since we are Tableau partners, we have a bunch of internal licences either
way. We host Tableau Server, ETL, and PostgreSQL ourselves, all on Centos
cloud servers.

------
derekmcloughlin
{SQL Server, Oracle} -> ETL to SQL Server (Hybrid graph db)-> SQL Server
Analysis Services -> {Power Bi, Tableau}

~~~
mipmap04
SSAS is my favorite cube / OLAP data store. I know I'm in the minority, but I
find it really user friendly for devs and users.

------
Dramatize
Product Manager at Replica here.

I've set up the following stack:

Segment -> (postgres DB, Intercom, Heap Analytics, Full Story)

postgress = Data dump for future usage. Intercom = CMS + Communication. Heap
Analytics = Product analytics. Full Story = Session analytics.

[https://replicastudios.com/](https://replicastudios.com/)

------
wrs
Various -> Airflow -> BigQuery -> Looker

We have a variety of data sources, from Mixpanel to PostgreSQL to black-box
systems that can only dump out CSV files. We use Airflow to manage pulling
from those, doing some transforms, and writing everything into BigQuery.
Looker then handles the reporting and browsing.

------
triiimit
ETL: Python & SQL

Warehouse: PostgreSQL

Reporting platform: Looker

Easy, agile, and cheap.

------
eibhinn
Oracle -> IBM Cognos

A nightly rebuild using ETL scripts written in sql. Not cheap or glamorous,
but solid for our needs.

------
jeffnappi
Data Sources (Analytics, DBs etc) -> Stitch (stitchdata.com) -> Redshift ->
Periscope (periscopedata.com)

This setup has worked pretty well for us so far. I've learned of a few tools
from this thread that might help us to better manage data sets and views -
specifically DBT and Dataform.

------
wladow
@sixo @numlocked I'd be happy to share more about Snowflake's architecture. As
others mentioned in this thread, Snowflake completely separates storage from
compute -- eliminating Redshift scaling issues.

Drop me a message, would love to chat.

william.ladow@snowflake.com

------
enra
We use Retool.com with our Postgres & other data sources.

The benefit we can also build tooling and workflows, in addition to the
metrics, tables and charts.

Early on you don't necessary don't exactly what end up needing, so malleable
tool is useful.

------
bass_case
Are any of you offering embedded analytics in your products? Mostly applies to
mid-market/enterprise SaaS platforms but interested in learning more about how
you offer analytics/BI to your customers in your products.

~~~
aceregen
Hi! You may like to check out what our team has built at www.holistics.io. The
pricing metrics of our embedded dashboards cater for you to scale to unlimited
viewers, and you don't need to recreate a duplicate copy of your user accounts
into our system for it to work. Check out the video below to find out more!

[https://www.holistics.io/guides/embed-
analytics/](https://www.holistics.io/guides/embed-analytics/)

------
importantbrian
We use the standard Microsoft stuff. SQL Server, SSIS, SSRS, PowerBI. There
are some pain points, but for the most part, it works, and it's pretty
inexpensive if you're already a Microsoft shop.

------
in9
No mention for Athena? Right now we are heavily using it at our org.

------
sunasra
At Qubole, I have this setup for internal reporting

MySQL -> Data export using Sqoop through Airflow -> S3 -> Spark -> Jupyter
Notebook

PS: Qubole is a data platform which makes ETL pipeline setup easy.

------
throwaway49409
how does one end up doing BI | Data Engineering ? currently a frontend | full-
stack JS dev. tired of that world and want to switch to something stable ?

~~~
vitorbaptistaa
I'd highly recommend the Kimball books, e.g. "The Data Warehouse Toolkit".

Unless you're working with a lot of data, the technical parts of a data
warehouse are relatively easy to grasp (the dbt team wrote some great posts on
how they structure their projects [1]). After that, what really makes a
difference is how you structure our data warehouse, how you model your data in
a way that allow you to query it efficiently and ask many questions. That's
where the Kimball books shine.

[1] [https://community.snowflake.com/s/article/Use-Case-How-We-
Co...](https://community.snowflake.com/s/article/Use-Case-How-We-Configure-
Snowflake-at-Fishtown-Analytics)

------
sdpurtill
Datacoral, which is our ETL and manages all materialized views within our
warehouse, replacing our need for data engineers Redshift Mode & Looker

------
spullara
Kinesis Firehose -> S3 -> Snowflake -> Sigma

------
flowerlad
MySQL and Pebble Reports [1] for reporting.

[1] [http://pebblereports.com/](http://pebblereports.com/)

------
robbiemitchell
Infra: Postgres -> S3 -> Looker

Business: Segment -> customer.io/Zapier/Heap Analytics + Looker

Support/Success: Intercom+Slack -> frame.ai

~~~
andrethegiant
+1 for Heap – pretty simple to segment, query, and organize data in my
experience.

------
EdwardDiego
Kafka streaming to validate and attach model data, fed into Druid by Flink,
queried via a custom front end.

~~~
chrisjc
> and attach model data

Can you elaborate on this? Do you mean enriching events and/or
windowing/aggregating/summarizing?

Are you using KSQL?

~~~
EdwardDiego
Yeah, enriching the events as they pass through - so we have a Kafka Connect
JDBC source polling the source of truth DB every X seconds for model changes
(we looked into Debezium for streaming model changes, but the additional
complexity didn't bring us much benefit based on how often our model changes),
and then we pull that into a GlobalKTable and join the KStream as appropriate.

We also use Kafka streaming with a persisted event window store to deduplicate
across a window of 2 hours.

~~~
chrisjc
Mind if I ask you what you used to capture source changes if Debezium didn't
work out for you? Is there a specific Kafka Connector for JDBC that you're
using?

~~~
EdwardDiego
Nope, just the JDBC Connector configured for Postgres. It wasn't that Debezium
didn't work out of us, more that it brought in more complexity (e.g.,
[https://debezium.io/documentation/reference/1.0/postgres-
plu...](https://debezium.io/documentation/reference/1.0/postgres-
plugins.html)) than the payoff justified.

If we had a need for near instantaenous model updates, then I'd definitely go
Debezium. We didn't use it because we didn't need it - a generic KC JDBC
source with a poll period of 5 seconds met our needs.s

------
Tharkun
We basicallly don't have a BI stack. Which is silly, but I guess we've had
other priorities.

~~~
crustacean
Better to have “nothing but prod” than something bad, imo/tbh.

------
8589934591
Client I'm working for:

* SAP -> SAP services -> tableau. * Some depts use Excel -> Python -> tableau.

------
kakoni
Embulk + Airflow => Postgres => Apache Superset(as UI/Tool)

ELT process so more DBT in the future

------
user7878
HubSpot -> AzureSQL -> PowerBI

HubSpot -> AzureSQL -> Tableau

Fully automated syncing with user friendly signup

------
valcker
Data warehouse: Postgres Data visualizations and reporting: Tableau, DataGrip

------
arkiver
We use a simple `Postgres -> CSV -> Pandas` pipeline.

------
tylerjaywood
ETL: Domo data connectors, Python, SQL

Warehouse: PostgreSQL

Reporting platform: Domo

------
vitinho_
does anyone uses elasticsearch + kibana?

------
roystonvassey
apps on pivotal to extract, transform, combine data -> to ms sql db -> powerbi
to visualize

------
eeZah7Ux
BI?

~~~
jarjarbinks455
Business Intelligence. Making reports. Usually over a large set of data. A
normal query would be too slow so the results are pre calculated. Maybe you
will calculate the average and total sales per city then dump the results into
a table. From there you can quickly generate more granular reports like sales
per state.

------
endlessvoid94
Postgres -> QueryClips

------
vitinho_
does anyone use ElasticSearch + Kibana?

~~~
IpV8
I do. I work in data consulting. I prefer tableau on top of snowflake for 99%
of jobs, but for massively scaling insert performance elasticsearch has a
niche. Kibana seems like a mishmash of different products duct taped together,
but for the most part plays nice.

