Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: DaLMatian – Text2sql that works (dalmatian.ai)
44 points by alandu 8 months ago | hide | past | favorite | 28 comments
Hey HN, we've built DaLMatian, a text2sql product that meets the needs of data analysts working with enterprise data. We built this app because as a data analyst at an enterprise I could not find a text2sql product that was (1) actually useful for my day-to-day and (2) easy to set up on my computer. Existing products either fall apart when tested on gnarly enterprise data/queries or require going through a sales/integration process that I wasn't in a position to push for - I just wanted something that I could quickly set up to help make my job easier. Our goal is to make this a reality for any data analyst that feels the same.

There are many constraints that make this reality difficult to achieve. The product needs to scale to databases with millions of columns and extract business logic from very complex queries. It also needs to be fast, at least faster than an analyst would take to write the query. On top of all this, an analyst needs to be allowed to use it from a security standpoint. Our app meets all the key requirements of an enterprise data analyst while also being lightweight enough to run locally on a typical laptop.

Here's how it works. To get started, you simply need to open a file of past queries in our IDE (try it here: https://www.dalmatian.ai/download) and add a file with your database schema (instructions here: https://www.dalmatian.ai/docs#configuration). There is also an option to connect a database to auto pull your schema (no actual data is seen by the LLM). We do not see anything you input since the app is local and the only external connection is with OpenAI. It's just like asking ChatGPT for help with queries, but in a streamlined way.

If you'd download our free IDE and try to break it, we'd love to hear what you come up with!




Two notes:

1) I appreciate that it's said to be local first but the fact that it depends on an OpenAI API usage is...kinda a big hole in that? The organization I work in wouldn't really accept this for approval, and from the title I was hoping that this would be a local-first fine tuned (or fine-tunable) LLM.

2) The about page stating that you met at Princeton is a huge bear signal for me. I don't think tools should be adopted based on how much of an elite (cognitive or financial or social or athletic or whatever) their creators are, and given the use of the OpenAI APIs I question why the "top ML conferences" bit is here at all.


The trend of these apps (admittedly, there are worse offenders than these guys) which stress how your data is completely safe, encrypted in transit, not stored on our servers, yours forever...by the way, everything is piped straight into OpenAI is a bit tiring.


Depending how you pay OpenAI you may be covered by their written pinky swear they aren't retaining, training on, or human-reviewing your data and queries.

Stay on top of it, as the shape of the claims keeps shifting as they adapt their business model, such as with the introduction of the Team billing.


Just want to clarify that OpenAI does not train on the query code and schema info we send via API. It's equivalent to using https://chat.openai.com/ setting "Improve the model for everyone" (previously "Chat history & training") to off in Data Controls.


Is it not an improvement over everything being piped to their servers, stored unencrypted, and saved before/after the OpenAI bit?

It's not everything, but there is a reasonable approach where someone would trust OpenAI much more than $startup.


Probably, if those are the two options. But it still feels disingenuous to lean on the privacy angle as hard as many of them do.

Just call it what it is. We're just a wrapper around gpt4, so the treatment of your data is subject to OpenAI's privacy policies and while we'll try to keep you informed of any changes you should be aware of that major dependency.


1 - yes our current solution does require you to be allowed to use ChatGPT/OpenAI. Unfortunately the accuracy using smaller models (even GPT-3.5) is poor. We don't see a local model (which will be much worse than GPT-3.5) even with fine tuning being anywhere close to good enough (would also require a really large number of queries). So we are relying on GPT-4 for now.

2 - agreed the background isn't why anyone should adopt a tool, just wanted to share our story. I would add that creating a good wrapper can actually be quite challenging, need to synthesize many pieces under constraints like memory, compute, speed, accuracy.


In AI/ML research, text to SQL always sounded to me of merely academic interest, in the sense that the outputs are easily verifiable and make for a good proof of concept of a language model's (or a translation model's) capabilities.

But looks like there are plenty of products coming out in this area, and it has me wondering: what is the actual big picture for enterprises here?

I would assume enterprises employ enough people to write yet another query for whatever use case.

- Is the expectation that in the future, we can bring the flexibility of SQL-like languages to people unfamiliar with SQL?

- Perhaps a salesperson unfamiliar with SQL would like to conduct an analysis. Is the volume and variety of such queries so high that optimizing for the turnaround time from an SQL query designed by data analyst to the salesperson to consume the results is so worthwhile?

Perhaps I am underestimating the scale of the problem but would love some insider perspective here.


I used to get slammed with so many requests that my boss had to tell the sales team to reduce the number of questions and only ask highest priority ones. Analytics teams serve a lot of different teams in an org, and the requests can really pile up. I was basically a bottleneck, which was a lose-lose for me since I was slammed with work and for business stakeholders too since they had to either wait a long time for responses or were limited in what they could even ask.


I see. Following up on this, for the sake of being explicit: was the bottleneck here getting all the data sources in place (perhaps for instance access permissions, legal, etc.), writing the SQL query, both, or something else?


The bottleneck was mostly in writing the SQL query, which took a lot of time due to the messiness/complexity of the data


Cant get this to work. Instructions are very unclear. Was unable to open a snowflake connection. Uploaded schema in a csv file. No indication of what needs to be done next. Assume that manage context queries is where it pulls info from. Added a query and provided a dsecription. Tried Q&A, nothing happened


If you are looking for a text to sql solution, I can in all modesty recommend my own https://www.sqlai.ai. Adding schemas can be done in any format and it automatically parsed/optimized by AI for optimal performance.


How does it handle large schemas? A quick schema dump shows that we have around 100K columns between all of our tables. Say we use 10% of that.

How easy is it to select which tables should be taken into account?

Is there an intermediate context layer step to work with this?

Do you need to provide working examples for the tool to know how joins and relations are usually handled?


> How does it handle large schemas?

For now we don't particularly handle huge schema, being over 10-20k tokens. I would suggest adding only the tables you need. But I will hopefully this week test a solution that automatically pre-determines which tables to include for huge schemas before the actual SQL query generation and thereby using relatively few tokens.

> How easy is it to select which tables should be taken into account?

You can edit the schema using code editor (it will be in CSV). That is relatively easy. I previously had a solution where you could manually tick off tables to include, but found it a bit cumbersome. Might add that back though. Ideally I want everything you run without the user having to include/exclude tables.

> Is there an intermediate context layer step to work with this?

I will test adding this. Must also be performant in terms of speed and reliability.

> Do you need to provide working examples for the tool to know how joins and relations are usually handled?

Normally AI can infer that from the schema. If not you can "teach" AI using RAG: https://www.sqlai.ai/posts/enhancing-ai-accuracy-for-sql-gen...


We are working with enterprise-scale schemas, and beyond SQLs like databricks, are doing Splunk, Elastic, graph DBs, etc, where many tables/cols is common. Part of the trick includes ideas like using embedding models and continuous learning for schema subsetting & tuning. We are deploying in govs, banks, insurance companies, etc, happy to chat if sounds relevant -- louie.ai . Notebooks, dashboarding, headless APIs, pen testing, audit logs, RBAC modes, a lot needed for enterprise settings!


If you open a .sql file into the workspace, the queries in that file will be auto parsed and used as context for Q&A. If you're willing, would love to help debug - could you email support@dalmatian.ai


Have you run this against UNITE? I'm curious to see how it benchmarks against other text2sql tools:

https://github.com/awslabs/unified-text2sql-benchmark


We have not come across any benchmark dataset that's actually worth evaluating on because the questions are not representative of real world enterprise problems. They don't reflect the degree of context needed to answer domain/business-specific questions accurately.


Can you give me an example of the sort of thing you're talking about? I've been using Defog's sql-eval a little bit, but I'd be interested in knowing more about its shortcomings when evaluating these systems.

https://github.com/defog-ai/sql-eval


An example question in that eval set is "How many publications were published between 2019 and 2021?". That's something GPT without any context can understand how to answer from a schema (which I assume has a column called publications). An example question that I'd get in my previous role at an ecommerce fraud detection company could be something like "what's the chargeback rate on the ATO segment". Neither chargeback rate nor ATO segment are defined in the database schema. Not only did they have different definitions depending on the context (e.g. which customer), the definition also change over time within the same context.


Why make it a full VSCode download instead of a plugin?


There are other product additions in the works, like hooking it up to your locally opened Slack. The plugin would be limiting


Is this open-source?


It's not, and the "Talk to Sales" and the ToS (https://www.dalmatian.ai/terms) strongly suggest they're targeting enterprise customers with bespoke enterprise-y pricing.


Right, we are not open source, but the IDE is free to use. The Slack integration and other product offerings that are in the works will be in the 'premium' version of the product that's sold to enterprise


Recommendation: your HN post shouldn't tell us more about the company and product than your website does.


Thanks we will rethink how the product is presented on our website!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: