
Launch HN: Hubble (YC S20) – Monitor data quality inside data warehouses - oliver101
Hey everyone! We’re Oliver and Hamzah from Hubble (<a href="https:&#x2F;&#x2F;gethubble.io&#x2F;hn" rel="nofollow">https:&#x2F;&#x2F;gethubble.io&#x2F;hn</a>). Hubble runs tests on your data warehouse so you can identify issues with data quality. You can test for things like missing values, uniqueness of data or how frequently data is added&#x2F;updated.<p>We worked together for the last 4 years at a startup where we built and managed data products for insurers and banks. A common pattern we saw was teams taking data from their internal tools (CRM, HR system, etc.), application databases, and 3rd party data and storing it in a warehouse for analysis. However, when analysts&#x2F;data scientists used the data for reports they would spot something suspicious and the engineering team would have to manually go through the data pipelines to find the source of the problem. More often than not it was simple things like a spike in missing values because an ETL job failed or stale data because a 3rd party data source hadn’t updated correctly. We realised that reliability&#x2F; trustworthiness of the raw data was essential before you could start abstracting away more interesting tasks like analysis, insight or predictions.<p>We wanted to do this without having to write and maintain lots of individual tests in our code. So we built Hubble, which connects to a data warehouse and creates tests based on the type of data being stored (i.e. freshness of timestamps, the cardinality of strings, max value of numbers, missing values, etc.). We’ve also added the ability to write any custom tests using a built-in SQL editor. All the tests run on a schedule and you’ll get an email or slack alert when they fail. We’re also building webhooks and an Airflow operator so you can run tests immediately after running an ETL job or trigger a process to fix a failing test.<p>Instead of asking users to send their data to us, the tests are run in the data warehouse and we track the test results over time. Today we support BigQuery, Snowflake and Rockset (which lets us work with MongoDB and DynamoDB) and are adding more on request.<p>We’re planning on charging $200 a month for a few seats, and $30-50 for extra users after that.<p>We’re still at an early access stage but want the HN community’s feedback so we’ve opened up access to the app for a few days, you can try it out here <a href="https:&#x2F;&#x2F;gethubble.io&#x2F;hn" rel="nofollow">https:&#x2F;&#x2F;gethubble.io&#x2F;hn</a>. We’ve added a demo data warehouse you can start with that has data on COVID-19 cases in Italy and bike-share trips in San Francisco. Thanks and looking forward to hearing your ideas, experiences and feedback!
======
jeremynevans
Customer here (comment not solicited!). We've been trying out Hubble for a
month or so and it's looking really promising.

I love the idea of being able to outsource the creativity/problem solving of
predicting things that could go wrong with our data to a service that
specialises in just that, and I can totally see how they can automate this in
a big way as they grow.

------
verhey
How does hubble compare to Great Expectations or DBT for pipeline testing? It
looks like more emphasis on automated profiling than "having to write and
maintain lots of individual tests" and obviously hubble being a saas offering
is the big difference?

Also any plans to profile and test file-based stores as well? There's a lot
that can go wrong in a pipeline before data even reaches BigQuery or
Snowflake, and you may help your customers save money if you could profile
data in S3 before it goes through a potentially expensive transform process.

Best of luck, though! Data testing is a very real need in most data
organizations I've been in, and I'm glad more and more tools seem to be
popping up recently to help with it.

~~~
oliver101
Thanks! We love DBT and take a lot of inspiration from their work. We’re
putting a lot of effort into suggesting the right tests based on the data
types, sources, and field names. A lot of these tests are pretty repetitive to
write so we want to make it easy to spin them up.

We’ve also found that keeping a history of the state of the warehouse over
time is really useful context for determining whether a test has failed
(example: this table tends to update every 30-40 minutes so we’ll set a
threshold at an hour).

We also handle the scheduling, which is surprisingly annoying to manage (we
built a couple of internal tools for this in the past). That’s something we
really missed with great expectations (you get this with DBT cloud). Testing
files is an interesting use case, to an extent we support this using Athena or
Bigquery external tables for json/csv/parquet. We’re intentionally limiting it
to SQL for now.

~~~
sails
Very interesting tool, I am trying to do this with Dataform/Looker, and feel
like some kind of inference like below would be great.

> this table tends to update every 30-40 minutes so we’ll set a threshold at
> an hour

Can you achieve these tests with metadata or do you need 100% read access to
the database?

I also wonder if this would work as part of a Analytics Engineering CICD
process? Something like how dbt cloud will block pull requests that fail
certain criteria.

~~~
oliver101
Metadata is a valuable place for finding information like load times, rows
inserted / updated. Currently we just rely on read-access and raw SQL. A
common way users are doing this now (and we are internally for our analytics
data) is using, for example, the Fivetran logs table to monitor ingestion
times and inserted rows, rather than querying the raw tables.

For CICD, absolutely we want to support this as well as stopping/conditional
execution in DAGs (e.g. airflow). We’re launching webhooks very soon

------
mushufasa
this is interesting! running tests on data is certainly a pain point for me,
and there doesn't seem to be nearly the kind of infrastructure available as
for, say, tests for code functionality.

Is this open source? Sending my data to a third party is a no-go, as is having
a third-party connect to the database. Something part of a managed hosting
service, though, or an add-on to an existing trusted hosted service that has
gone through compliance (e.g. Heroku, AWS), would be more palatable.

~~~
hamzahc
This was the same pain point we had when we saw how good the tools were for
testing our software vs our data.

It's not open source but we can deploy on-prem (or cloud-prem more accurately)
pretty easily. We’re also going to setup as an add-on available through AWS
marketplace. Feel free to shoot me an email if you want to see if this can
work for you hamzah[at]gethubble.io

------
LittlePeter
Running a full table scan on BigQuery every hour can get quite expensive. Do
you support some sort of deltas?

I signed up. Unlike the video, I do not see Redshift as an option. Any idea
when Redshift will be supported?

How does billing per user make sense here? What prevents me monitoring
thousands of tables under single user? Your workload costs will be higher than
$200 here, no?

Do you have a set of fixed IPs you're connecting from to allow me to whitelist
you?

~~~
oliver101
Full table scans can get expensive. We’re adding support for incremental tests
so for append-only tables you’ll only test the recent rows. This is especially
useful if you use partitioned tables in bigquery.

Actually in the first version of the product we automatically tested every
column in every table. The tests are more selective now, which is partially
due to cost and partially because nobody wants to navigate through 10,000
tests.

Redshift will be supported this week! We have a list of new sources to get
through and it’s right at the top. We’ve been emailing over the IP for
whitelisting but we’ll add it to the connection page too.

As for pricing, we’re experimenting. Our costs do scale with number of tests
(more scheduled tasks, more historical results stored). At the moment we
retain the last month or so of test results, which is manageable for pretty
large workloads.

~~~
LittlePeter
Looking forward to Redshift!

BTW, you don't need to navigate 10K tests... you only need to navigate the
failing ones.

------
scapecast
Co-founder of intermix.io here (which we sold in March). We came more from the
performance monitoring angle (specifically for Redshift), but then shifted to
a product that works horizontally across all warehouses, to track usage,
workflows and user engagement. "Shift to Data Products" was the narrative we
started using in Q4 2019. If you read the copy on the current intermix.io
website, I think you'll find yourself nodding. (FYI - we got bought by a small
PE Fund that is rolling the product into Xplenty, an ETL product).

My experience is that monitoring data quality is a still an under-appreciated
discipline. I've found that most teams still have an "not invented here"
mentality, or don't even know they have the problem! That can lead to a "oh,
we can just fix it when it happens" type of mentality. But your timing may be
better than ours - we started back in 2016.

I haven't played with your product (yet), only took a look at this thread and
your website. Some observations:

\- SQL Editor - big plus! I think giving your users a space where they can
take action is a super value-add, we didn't have that.

\- nice work running the tests inside the customer's warehouse. That has two
benefits for you. 1) you're not incurring the cost to crunch the metadata, it
can get quite expensive, depending on the number of tables in the warehouse.
2) you're avoiding data access issues, getting access to the warehouse was
always a hurdle, even though we only needed access to the system tables.

\- pricing model. I think the per-seat model is the way to go. We tried
charging by number of rows, and size of the warehouse (number of nodes), but
then you run into weird situations with customers who are dealing with huge
historic datasets, but really only look at the last 30 of data.

My unsolicited $0.02 is that you think hard about distribution. I think you
want to think about hitching your wagon to the cloud marketplaces, and
Snowflake's marketplace. For example, attaching themselves to Snowflake is
what made all the difference for Fivetran.

I have a bunch of more scars that I can share if you care to know them :-)

~~~
texasbigdata
Fantastic blog post, thanks for sharing.

So I guess if you had to pick arbitrary revenue/data/fte cutoffs, do you see
the org chart of these adopters as you’ve described looking a certain way? Let
me try to rephrase that.

Do you think there’s a step function of “here you need one DBA who is a holy
librarian” and “here we need a gitlab styled data team with SLAs and the data
equivalent of HR business partners who get assigned to the BU”?

Tangential to your comment but curious if you believe the human side scales
akin to the infrastructure side.

~~~
sails
Where is the blog post?

------
_Microft
Have you considered picking a different name? Searching for "Hubble" for
whatever reason is going to return millions of irrelevant results for your
customers.

~~~
anticsapp
I can't think of a worse name for SEO purposes. You'd have to fight through a
well loved and well known space telescope, the astronomer it was named after,
and Hubble contact lenses, which has raised ~74MM.

~~~
switz
If a customer is looking for you specifically, they will find you (e.g.
"hubble data" as stated above). If they are looking for a "data quality
monitor" then the SEO will need to reflect that. The name is largely
irrelevant at that point, it's merely a moniker.

In the grand scheme of problems a new company has, this is so trivially minor
that I can't fathom this having any tangible effect on the success of a
company. It's one thing if there's another data warehousing company called
"hubble", but that's not the case you're making.

~~~
Kye
Hubble data brings up, as I would expect, data from the Hubble Space
Telescope. Not one of the first page of results points to anything else but
HSTS information.

~~~
switz
The product literally just launched -- give it a few weeks, it'll show up.

~~~
Kye
I don't know who's advising you on SEO, but you will not ever outrank STSCI,
NASA, ESA, AWS Open Data's HSTS archive, The Planetary Society, the National
Academy of Sciences, or the ESO on "hubble data" as long as Hubble is still
what people think of when they hear Hubble. The telescope and related
sites/agencies/organizations have a 22 year head start building a relevant
link profile in Google. And if you did, Google would get suspicious.

Hubble is fine as a name if you pick the right keywords to target in your
marketing, but "Hubble data" is never going to show a link to something that
isn't at least tangentially related to the telescope.

~~~
gk1
You're getting carried away with the example "hubble data." The point is that
people will modify their search terms until they find the company they're
looking for. If they don't know the company they're looking for then they will
search by use case (eg, "detect data drift"), in which case the search results
for the company name don't matter.

------
hribo
I signed up and I think the concept is promising. It was very easy to add a
couple of tests. SQL interface is handy and convenient, but sometimes still
limited. It would be good to add a support for some custom scripts (i.e.
Python, R). Another important thing for my team would also be seamless
integration with other tools (i.e. email, SMS, Slack) to notify the team about
the failed test(s).

------
12ian34
+1 for alleviating data scientists/engineers of boring, repetitive manual
tasks and empowering them to focus on the more challenging stuff

------
iblaine
What does the tech stack look like?

Is there any caching for those situations where you may read the same
historical data over & over?

~~~
oliver101
Yes, we store the historical value of each test so you can always scroll back
through time and see the state of the data warehouse at any given point.

For example, if you have a test that counts the number of rows "COUNT(*)" \-
that value will be recorded. So you can look back an hour/day/week and see how
many rows the table had without executing any SQL. These values are stored in
a time series db, so querying history is fast.

Our tech stack: monolith backend in python + postgres + react. The test
themselves are all SQL queries and run in the data warehouse.

------
hg_
Do you have/think you need an on-prem version?

~~~
oliver101
Yes we can run the whole stack on-prem. We realised very early that on-prem
would be needed for many users. So we've made it easy to spin up Hubble in a
k8s cluster in your cloud or on bare metal.

