Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Elementary (YC W22) – Open-source data observability
122 points by Maayansa on March 4, 2022 | hide | past | favorite | 30 comments
Hey HN! We’re Maayan and Or, and we are building Elementary (https://github.com/elementary-data/elementary), an open-source framework that continuously monitors your data and sends alerts when anomalies are detected.

Elementary can alert you, for example, when a table in Snowflake hasn't been updated as expected or when a revenue column has too many nulls. It also monitors operations in the data stack, and provides for analyzing both impact and root cause. For example, it can alert you when your dbt runs or tests fail, including the impacted dependencies. A data lineage graph visualizes the data flow and can be used to find the source of invalid data.

We have both been working in the data space for over a decade, Maayan in analytics and Or in data engineering. Despite working at very different companies with different data stacks and use cases, we had the same reliability problems. Data is incomplete and inconsistent, and the abundance of technologies creates more complexity and inconsistency. Data reliability issues cause distrust, delays, and bad decisions. It's hard to achieve high data reliability, detect issues fast, understand impact and resolve quickly.

We also found that we had built similar solutions, and as we talked to other developers, we learned that most data teams have their own version of the same thing. They usually don’t go for a commercial observability solution unless they have major incidents and technical debt. Until that point, they prefer to build for themselves, for two reasons: to avoid the overhead of procurement and security compliance; and to customize to their own stack, data sources, business logic, etc, and have all the metadata and metrics in their stack to support additional use cases.

We decided to build an open-source alternative—one that can be implemented easily, hosted yourself, and customized. This solves the compliance and the data ownership problem. It also solves the build-your-own problem, because teams can deploy an extensive solution early on, instead of waiting till later when there are major problems.

Elementary stores all the logs, metadata, and metrics it collects and generates in the data warehouse, so users can easily add their own detections and logic to it. Additionally, the solution is dbt native, which means it provides a dbt package that can be easily installed in a dbt environment as well as configured directly from a dbt project. Since it's part of the existing workflow and environment, it makes it convenient for data engineers, analytics engineers, and data analysts to enhance and contribute.

Open source eliminates the need to pay for getting started or to grant access to a third party. A managed service and additional enterprise features will be available in the cloud in the future. Critically, though, the user's metadata will continue to reside in their environment, under their control, and they will still have full customization available.

Currently Elementary supports Snowflake, BigQuery and dbt. It collects metadata such as schemas, query logs and dbt artifacts. It generates and monitors data quality metrics, sends Slack alerts, and visualizes the data lineage. If this is your data stack, we’d love you to give it a try!

We would love to hear your feedback, experiences and ideas from trying to solve data observability in your organizations.




Sparsity and non linearities in data needs to be factored in. Then data aging and maturity. The way anomaly gets detected matters based on these. I have seen consistent Alert floods whereas in reality they were temporal fluctuations or temporal correlation between variables. Measurement and sampling time in timeseries across touch points is another important design decision.


Congrats on the launch! As a former data scientist, I suffered from bad data on a daily basis. Can you provide some details on how anomalies are detected? Is it some kind of threshold-based approach defined by the user or are you running statistical analysis on user's data? Curious to learn more!


Anomalies are detected based on a statistical analysis of the data and is measured in terms of standard deviations from the mean. We have lots of plans on improvements in the future and curious also to learn how you would approach this problem as well.


The pain for sure is real. The way we approach it: Whenever there is a data change we re-run our models and see if it improves performance or some other metrics (which are very specific to us). I wonder how valuable "generic" metrics are.


Is this something similar to Metaplane, also YC, that recently launched?

https://news.ycombinator.com/item?id=29226864


Hey Maayan and Or, Nice project, at re_data we just got over a lot of your new updates and it seems a quite large part of your project is "inspired" by code from our library https://github.com/re-data/re-data. Even with parts, we are not especially proud of ;)

If you decide to copy not only ideas but a big part of internal implementation, I think you should include that information in your LICENSE.

Cheers


Is the idea here that it's inspired by re_data due to using dbt transformations underneath or because it's reposted looking nearly the same? (or both?)

Looks like much of the lineage code is also largely a wrapper around this library: https://github.com/reata/sqllineage

Would be curious to understand the project's purpose and unique contributions vs. the underlying dependencies powering it as there seems to be some ambiguity. Is this just a wrapper around dbt transformations and a lineage library in one package? Can I just use them directly?


It's "inspired" the dbt transformation part by using the same models and logic/part of code of generating them. We, for example, had a funny thing of computing metrics in 4 threads via multiple dbt models, and this is also done in elementary in a very similar way :)

The lineage part is independent (re_data uses lineage from dbt), so I haven't looked into that much.


While writing our dbt project we looked into more than 60 dbt projects to learn from prior work while developing Elementary, and have been inspired by different things in different places. You're right that we were inspired by a couple of techniques you used, one being that creative way to improve performance (though the 4 thread setting itself is the dbt recommendation in their docs). Another is using z-score for anomaly detection, which we saw in a number of related projects and it's widely used in the industry.

In terms of the lineage, you can see in the code that we mostly rely on query and access history that exist in Snowflake and Bigquery to parse the queries and learn about the connection between nodes in the graph. We use other python libraries like sqlfluff and sqllineage as low level parsers for some specific use cases which we extend and solve many things on top of them. Actually we're heavy open source users, depending on around 20 libraries, all MIT or Apache.


Okay, I'm happy that you admit inspiration in this comment (in opposition to the previously deleted one).

Also, I think it's more than just following up re_data in a couple of places. Elementary's whole data monitoring part started much later than your Lineage part, and it seems to try to follow what re_data did there on the idea & implementation level. I'm sure the other 59 projects you mention were not dbt packages for data reliability (there were no other one in the dbt hub) which is what re_data is and now elementary also tries to copy this. (seeing our traction)

As mentioned, it's open-source. You can use our code. But if you are doing that, state that clearly in the LICENSE.


I think mateuszklimek is pointing out that the MIT license requires you to include the redata copyright in your source.


Right on point, they don't even have a filled out LICENSE on the repo

> Copyright [yyyy] [name of copyright owner]

https://github.com/elementary-data/elementary/blob/master/LI...


Gotcha - I can see what you mean, appreciate the clarification


If you're going to make an accusation like this on HN, you should provide line by line evidence. Saying "you copied us" without any examples makes you incredible.


Sure, please compare: https://re-data.github.io/dbt-re-data/#!/overview?g_v=1 and https://docs.elementary-data.com/ graph png.

Elementary models like data_monitors_thread1, data_monitors_thread2, data_monitors_thread3, data_monitors_thread4, data_monitoring_metrics, latest_metrics, metrics_stats_for_anomalies, z_score, anomaly_detection, schema_schenages, etc. Existed before in re_data, are doing the same things and specifically for *_thread4 are not similar to anything you normally do in dbt.

And these similarities are also visible in code, for example here: the same usage of the undocumented dbt context feature.

# elementary

{% macro get_monitor_macro(monitor) %}

    {%- set macro_name = monitor + '_monitor' -%}
    {%- if context['elementary'].get(macro_name) -%}
        {%- set monitor_macro = context['elementary'][macro_name] -%}
    {%- else -%}
        {%- set monitor_macro = context['elementary']['no_monitor'] -%}
    {%- endif -%}

    {{- return(monitor_macro) -}}
{% endmacro %}

# re_data

{%- macro get_metric_macro(metric_name) %}

    {% set macro_name = 're_data_metric' + '_' + metric_name %}

    {% if context['re_data'].get(macro_name) %}
        {% set metric_macro = context['re_data'][macro_name] %}
    {%- else %}
        {% set metric_macro = context[project_name][macro_name] %}
    {% endif %}

    {{ return (metric_macro) }}
{% endmacro %}


Pretty strong accusation, are you sure re-data isn't "inspired" from Monte Carlo? :)


It is! But it doesn't have Monte Carlo code in it :)

And it's open-source so it's generally okay to do that, but it should be reflected in the LICENSE.


For any dbt users, their reliability package has the best and most comprehensive way to upload artifacts directly to the warehouse after a dbt invocation.

https://github.com/elementary-data/dbt-data-reliability


Thank you! We believe that this upload is super valuable and could unlock a lot of additional use cases. We are already working on some of these and will release in the next few weeks.


I really like that this stores all data in the data warehouse, unlike Monte Carlo.


This sounds very cool. Looking forward to trying it out on our data layer.


Thank you, would love to get your feedback and thoughts on what we should add.


I'm disappointed to see there isn't Redshift support. What's on the roadmap to address that?


Hi, it's defiantly in our plans for the next few weeks. As I mentioned in the post we leverage dbt for the whole data monitoring layer. We wrote the package with all the cross-db best practices, but didn't test it on Redshift so there are probably some gaps. There are a few users on our community on Slack that are Redshift users and they will help us to test on production with real data. Hopefully it will not require much to make it run smoothly.


Great job Elementary team! Does this in essence similar to the aws deeque project but fancier and more inclusive of edge cases, common scenarios? (https://github.com/awslabs/deequ)


Hi, thank you! The way we see it, AWS Deeque, as well as Great Expectations and dbt tests, are used for static data testing. This is great for many use cases, however there are problems you will only detect by continuously monitoring. Just like in software engineering you use both unit testing and monitoring.


this is awesome - much needed and your OSS starting point will be really attractive. will give it a go!


Thanks! we would really love to hear your feedback or if we can help somehow!


What's with this comment section? there's a lot of dead comments...


Someone sent out a link or something and it led to a bunch of booster comments. That's not allowed on HN, so I killed the comments and emailed the founders asking them to make it stop.

The guidelines say not to do it: https://news.ycombinator.com/newsguidelines.html, the FAQ says not to do it: https://news.ycombinator.com/newsfaq.html, and the Launch HN advice we give to YC startups says to "make sure" (in bold!) not to do it: https://news.ycombinator.com/yli.html. But people who aren't familiar with HN's conventions still end up doing it sometimes.

We make a distinction between innocent mistakes (which are usually obvious) and repeat offenses (where people usually try to cover their tracks). The former isn't a big deal, the latter we ban accounts for.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: