Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: MindsDB (YC W20) – Machine Learning Inside Your Database
176 points by adam_carrigan 17 days ago | hide | past | favorite | 60 comments
Hi HN,

Adam and Jorge here, and today we’re very excited to share MindsDB with you (http://github.com/mindsdb/mindsdb). MindsDB AutoML Server is an open-source platform designed to accelerate machine learning workflows for people with data inside databases by introducing virtual AI tables. We allow you to create and consume machine learning models as regular database tables.

Jorge and I have been friends for many years, having first met at college. We have previously founded and failed at another startup, but we stuck together as a team to start MindsDB. Initially a passion project, MindsDB began as an idea to help those who could not afford to hire a team of data scientists, which at the time was (and still is) very expensive. It has since grown into a thriving open-source community with contributors and users all over the globe.

With the plethora of data available in databases today, predictive modeling can often be a pain, especially if you need to write complex applications for ingesting data, training encoders and embedders, writing sampling algorithms, training models, optimizing, scheduling, versioning, moving models into production environments, maintaining them and then having to explain the predictions and the degree of confidence… we knew there had to be a better way!

We aim to steer you away from constantly reinventing the wheel by abstracting most of the unnecessary complexities around building, training, and deploying machine learning models. MindsDB provides you with two techniques for this: build and train models as simply as you would write an SQL query, and seamlessly “publish” and manage machine learning models as virtual tables inside your databases (we support Clickhouse, MariaDB, MySQL, PostgreSQL, and MSSQL. MongoDB is coming soon.) We also support getting data from other sources, such as Snowflake, s3, SQLite, and any excel, JSON, or CSV file.

When we talk to our community, we find that they are using MindsDB for anything ranging from reducing financial risk in the payments sector to predicting in-app usage statistics - one user is even trying to predict the price of Bitcoin using sentiment analysis (we wish them luck). No matter what the use-case, what we hear most often is that the two most painful parts of the whole process are model generation (R&D) and/or moving the model into production.

For those who already have models (i.e. who have already done the R&D part), we are launching the ability to bring your own models from frameworks like Pytorch, Tensorflow, scikit-learn, Keras, XGBoost, CatBoost, LightGBM, etc. directly into your database. If you’d like to try this experimental feature, you can sign-up here: (https://mindsdb.com/bring-your-own-ml-models)

We currently have a handful of customers who pay us for support. However, we will soon be launching a cloud version of MindsDB for those who do not want to worry about DevOps, scalability, and managing GPU clusters. Nevertheless, MindsDB will always remain free and open-source, because democratizing machine learning is at the core of every decision we make.

We’re making good progress thanks to our open-source community and are also grateful to have the backing of the founders of MySQL & MariaDB. We would love your feedback and invite you to try it out.

We’d also love to hear about your experience, so please share your feedback, thoughts, comments, and ideas below. https://docs.mindsdb.com/ or https://mindsdb.com/

Thanks in advance, Adam & Jorge

Is it inside the database? It looks as though it's actually in a separate server, that is called by the database.

From the user perspective it's inside the database, you can run mindsdb in the backgrond,connect it to the database once, and then do everything from within the database (i.e. connecting with a sql client to your database server and issuing commands the same way you would query "normal" tables).

From a technical perspective it's a separate server that communicates with the database through various mechanisms (e.g. federate engine) but it's no different from e.g. multiple instance of mariadb being abstracted by a galera cluster into something that behaves like a single database from a client perspective.

Gotcha, yeah not who were replying too, but I read/skimmed most of your post first thinking was a new DBMS specifically for machine learning (especially given the name), which is a lot less interesting to me and I was about to close the tab. I much prefer the phrasing you used on the GitHub, I think it's more clear: "Predictive AI layer for existing databases"

Anyway, I'm seriously going to look into using this on a product for work, thanks for sharing & your work on this! We are in the process of triaging a few different features for a product in education (can chat more about it if your curious!), including some fairly bog-standard predictive text features. We are also trying to avoid strictly depending on any one proprietary PaaS, so this might just fit the bill!

We are honestly very happy to hear about every usecase people have for this right now. It's users that help us shape the product to a large extent, at the end of the day.

We can chat more about it here or feel free to pick one of the many contact routes scattered around the thread :)

What's the name of the product? What does it do?

Sure, from a client perspective that makes sense, but it misled me a bit. I thought the DBMS was running the ML, a bit like an ML equivalent of a materialised view.

What type of ML algorithms do you support? Do you have benchmarks with performance?

Regrading benchmarks, we have three main dataset collections we focus on currently:

1. Datasets from customers, but obviously those can’t be made public.

2. The OpenML benchmark, which is fairly limited because it’s mainly binary categories, but which is good because it’s a 3rd party, so unbiased. We have some intermediary results here (https://docs.google.com/spreadsheets/d/1oAgzzDyBqgmSNC6g9CFO...) , they are middle-of-the-road. However I think the benchmark is pretty limited, i.e. it doesn’t cover most of the kinds of inputs and almost none of the output we support

3. An internal benchmark suite which currently has 59 datasets, mainly focused around classification and regression tasks with many inputs, timeseries problems and text. Some part of it is public but opening that up is a bit difficult due to licensing issues. I’m hoping that in the next year it will grow and 90%+ of it can be made public. We benchmarkagainst older versions of mindsdb, against hand made models we try to adapt to the task, against the state of the art accuracy for the dataset (if we can find it) and a few other auto ML frameworks (well, 1, but I hope to extend that list) [see this repo for the ones we made public: https://github.com/mindsdb/benchmarks, but I'm afraid it's a bit outdated]

That being said benchmarking for us is still WIP, since as far as I can tell nobody is trying to build open source models that are as broad as what we're currently doing (for better or worst), and the closed source services offered by various IaaS providers don't really come with public benchmark results outside of marketing.

The benchmarking challenges you are facing are pretty common in the AutoML community. My colleagues and I at Google Research are trying to solve this with https://github.com/google/nitroml. It's still super early days (no CI yet), but I think it could help your team benchmark on a set of open standard benchmark tasks as we open source more of the system.

Looks quite interesting, already pinned this in the relevant slack channel :)

To be honest I'm rather happy with how the internal benchmark suite is turning out, but to some extent you are inviting bias by creating them yourself. On top of that, it doesn't hurt to have more benchmarks.

At the end of the day it's a combination of: * How much work is it to integrate (easy to measure) * How visible is it, i.e if we actually find something interesting will be visible and legible to others (ify to mesure, citations, stars, etc are some invitation) * How useful it is to "improve" the library (hard to measure, and what we aim to be good at is a moving target)

So realistically that's the equation I have to judge in terms of adding a new benchmarks suite, and it's very annoying because you'll note the most important things are the hardest to measure.

Would you want people to integrate with this now or would you rather wait a few weeks/months/years until it matures more? If the former, can you give a few details regrading where to start (README is fairly barren), if the later please ping me (george.hosu@mindsdb.com) when you think it could be ready to try.

Anyway, any open benchmark library is a step in the right direction, thanks for working on this :)

Thanks for your feedback! Based off the description of how you already do things, I'd say you're ahead of the curve as far as rigorous model quality benchmarking. You should absolutely hold off of using nitroml for a few months until it's more mature. It's very much pre-prerelease in a build-in-the-open sense. :) I'll shoot you an email once it's ready for anyone to try out. When the time comes, we'll have a blog post to announce it, and will include proper documentation.

And, congrats on the launch!

The design is modular such that it can support anything under the cover.

Essentially you have encoders for all of the columns, which then get piped into a mixer and then into decoders to predict the final output(s). These encoders and decoders can be any type of ML model, but our current focus is on neural networks.

So e.g. if you have say a text like "A cute cat" and the number 5 and your target is an image (let's assume you have a training set such that the model would learn to generate one with 5 cute cats) then you have:

1. Text encoder generates an embedding for (cute cat) + numerical encoder normalizes "5" 2. A mixer (which can be e.g. an FCNN or gradient booster) generates an intermediate representation. 3. A decoder that is trained to generate images takes that representation and generates an image1.

Note: above is a good illustrative example, in practice, we're good with outputting dates, numerical, categories, tags and time-series (i.e. predicting 20 steps ahead). We haven't put much work into image/text/audio/video outputs

You should be able to find more details about how we do this in the docs and most of the heavy lifting happens in the lightwood repo, the code for that is fairly readable I hope: https://github.com/mindsdb/lightwood

Also worth mentioning, Mindsdb can take input columns of any of the following (numerical, categorical, text, images) and it's getting pretty good at Timeseries problems (for which we support a variety of techniques, including novel approaches to sequential data such as (RNNs, Transformers, CNN tiling, ...). Given the nature of data in databases where there is often a chronological order of transactions we put allot of focus on offer capabilities to make the models time aware.

I've been following you, guys, for some months and I must say I'm a huge fan. Being a hardcore Clickhouse user, I got hooked with your tutorial about how to make it work with your product.

Best of luck!

Cheers :)

It's actually quite nice for me to hear that people we didn't hear from yet are finding it useful. Since it's a library it's hard to actually figure out how many people are really using it successfully and what for.

If you don't mind sharing more details please do, either through our usual channels (https://mindsdb.com/contact-us/) or just send me an email (george.hosu@mindsdb.com). Figuring out how people use it and what issues they encountered has been immensely helpful to me.

thank you!! lets chat, we would love to show you the timeseries cool stuff we have done for clickhouse!

Sure thing! I'll contact you using the details in your website!

Is it possible do the anomaly deduction with clickhouse data?

This is actually the next milestone in our time series roadmap, so you can expect it to be possible rather soon :)

We are also looking for it.. How to track this feature's progress.. is there a roadmap or github issue for this..

Here's an issue that enumerates all pending tasks for a first iteration of this feature: https://github.com/mindsdb/mindsdb/issues/1116

This is great. The documentation was clear and highlighted the value proposition immediately. This makes ML a lot more accessible for a developer such as myself.

Go MindsDB!! We've enjoyed working with the MindsDB team at Altinity. The integration with ClickHouse makes clever use of the MySQL protocol to implement models as queryable tables. For anybody interested in the specifics check out the following article: https://altinity.com/blog/machine-learning-models-as-tables.

We will watch your career with great interest.

Thanks, it has been amazing working with you also.

Based on the feedback here it may soon be time to do a follow-up talk on MindsDB at a future ClickHouse meetup. :)

yes!! we were thinking about this too! we have made some massive progress on timeseries models and lowering latency, lets chat soon, I'll connect over email!

How does it compare to Apache MADlib? It allows to do ML in the database with SQL interfaces, and has been around for a few years.

As far as I can tell MADlib is not automl, it just provides various statistic analysis and "classical" ml algorithms as function/macros in various database, and integration seems to be quite different from the way we do it (and I'd say more complex for the user, but maybe that's just my bias talking).

So I don't think there's a lot of overlap there. But if you think otherwise and work on it or know someone that does, I'd be quite excited to have a chat, just to share experiences and tips if nothing else.

Deep learning, including automated model selection, is under development in MADlib. See for example https://madlib.apache.org/docs/latest/group__grp__keras__run... I guess in the next couple of releases it will probably stabilize and be promoted out of "early development".

I don't work on it or know anyone who does, but it is one of the most established open projects for ML in the database, as far as I know.


So I assume that you are doing hyperparameter search? Can you share what optimization method you are using for search (e.g. random, gp )?

Also, is the search can be distributed in parallel to multi node ?

And, if mindsdb is not part of the db, what happen if minddb fail ?

Also, do you support automatic retraining? If yes, can you elaborate more?

These are amazing questions Streetcat, We do some hyperparameter search using Optuna, we may be moving to Ray Tune because it can be highly parallelized. If MindsDB fails, it depends on how various DBs manage federated storage, but essentially you will get a query error. Funny that you mention automatic retraining, people have been asking for this recently, we will be supporting a retrain_frequency parameter in the coming releases, would you like to give it a test drive?

I am actually working on a product in the same area (auto ml/ mlops ) as a non YC startup... We might be able partner. I am not sure how to reach you?

Send us an email - Adam at MindsDB.com and Jorge at MindsDB.com

absolutely lets connect!! jorge at mindsdb

> So I assume that you are doing hyperparameter search? Can you share what optimization method you are using for search (e.g. random, gp )?

Short answer is optuna and ax but only sometimes.

Long answer lead me down a rabbit whole and it's 10k+ words and a few experiments deep. If you're interested in this are specifically ping me, but I've got nothing concrete, however I like discussing it. A recent paper I saw that somewhat echos my thoughts is: https://arxiv.org/pdf/2102.03034.pdf | but some bits feel either over my head and/or overly pedantic and/or overly formal | and I'm not sure I agree with the conclusion | and loads of it is irrelevant. But if the problem interests you I'd suggest giving it some time, with those disclaimers in mind

> Also, is the search can be distributed in parallel to multi node ?

Theoretically yes, practically it's still WIP to get this to work, but the architecture we have right now is very much conceived with massive distribution in mind (see our docs for more details on that).

> And, if mindsdb is not part of the db, what happen if minddb fail ?

The select query you use to make a prediction returns an error, essentially. Assuming you mean "what happens if it crashes or if the model you are using crashes?".


psql> SELECT diagnostic FROM mindsdb.flu_detector WHERE headache=true AND temperature=37.5 AND cough='mild';

psql> Error: External table returned error: "Segfault"


psql> SELECT diagnostic FROM mindsdb.flu_detector WHERE headache=true AND temperature=37.5 AND coughsfsagsa='mild';

psql> Error: External table returned error: Input column `coughsfsagsa` doesn't exist

(or something like that)

> Also, do you support automatic retraining?

Not at the moment, but we're going to add it very soon, with the first implementation allowing retraining with a certain user-set frequency (e.g. once every 2 hours).

Which will allow the model to be always fresh as new data comes in (assuming there's no time limit on the query)

Wow. Thanks for the answer and for the paper !. I myself implemented this: https://arxiv.org/pdf/1810.05934.pdf in go.

The issue with retraining is that you need new labels (assume supervised ML). so I wonder what process do you use to get those.

Huge fan since I saw them in Skydeck Berkeley, also Jorge is one of the best talented engineers, and managers I have ever meet and an inspiration to myself. Awesome to see Adam & Jorge pushing the limits of ML to make ML accessible to everyone!

I got to know about mindsdb when it was a university (Berkeley) project and Amie mailed me to try it out after finding out my blog posts. I was thrilled to know that it was kind of a black box system where one just feed data and system will pick model itself based on data. I found it interesting because I am not an ML guy and as a mere developer I wanted something that do some ML magic without going in depth. Glad to know project is evolved. I will try it again. Congrats!

Looks super interesting! I'm quite literally the ideal user you describe, and I can't wait to try this out. If I can get any meaningful outputs from my databases thank to mindsdb I'll be super happy.

Also, good idea the hosting, I am currently struggling in training my models as I don't have the needed processing power, instead of bulding a new PC for ML tasks I would very likely happily consider hosting the heavy lifting. I'll keep an eye on the release.

If you're interested in participating in a "beta" release of that, we have a newsletter: https://mindsdb.com/newsletter/ where I'm 99% sure it will be announced, and in case you don't get a code ping me and I'll send you one :)

But the timeline for when to release this is still inexact, since ideally I want the "beta" to be as stable as possible, as to not have to get people to migrate to a different version later.

Also, if you have the time to share your usecase in like 3-4 paragraphs please do, either here or email me, because we design this with users in mind at the end of the day.

Sure thing, I'll do both :-)

Congrats on your launch Adam!

What deployment scenarios do you foresee? (given the license is GPL, which is infamous in the business world)

Hi Omneity, thanks for asking. You can deploy Mindsdb on a container, no application that you build querying the Mindsdb server has licensing dependencies, as its essentially no different from using most opensource databases out there.

This is pretty cool, I've been following you guys since your Skydeck demo day back in 2018.

Would we be able to use your tool for Tableu?

That's awesome :). Yes, you are able to use any BI Tools as Tableau, Power BI, SAS BI, or any other BI Tool that you can connect to external databases. Tutorials and examples of BI Tools should be published soon on our documentation.

Looks very interesting! Any plans to support DynamoDB/Scylla?

I would love to support Scylla, I ** love that database, those guys are magicians. And I assume in supporting that we'd also offer de-facto support for Cassandra.

I don't think either Scylla or dynamo are on the roadmap now, but if you want them feel free to create an issue asking for them: https://github.com/mindsdb/mindsdb

It should be noted that there's two level of support:

1. As a source of data (easy to implement) 2. Being able to publish models into the database (a bit harder)

If you work with those and are interested in doing ML from the database please get in touch, ideally via github, but you can also use the contact form (https://mindsdb.com/contact-us/) or email one of us directly. The best case scenario for us is that when we do one of these integrations we have an actual user in mind, and we're open to "first users" for any database where we can find a reasonable way of integrating.

Thanks for asking! We develop based on community requests, if you want we can setup a call, and we can see if we can help you with your data on DynamoDB/SCylia, its becoming pretty fast now to support new databases. send us an email to adam jorge at mindsdb.com

If you do have a Scylla use case, please also feel free to let us Scylla monsters know: peter at scylladb dot com. Very encouraged to see such a great convergence occurring between NoSQL + ML. The camps between data scientists and data engineers have been pitched too far apart to date. My best wishes to all those who helped bring this to fruition.

Support for graph databases would be cool.

Did you have one in mind? We add integrations based on community demand.

Support for Tinkerpop would probably make it possible to support all the major ones, e.g. OrientDB and Neo4j.

How does your product differ from MS SQL's integrated R, for someone who only needs MS SQL Server support?

this is a great question, I actually think that if your language is also R, MSsql r integration is a great option, what we bring to the table for MSSQL users in particular is more options as well as better performance for some types of problems like high cardinality on time-series, for example: predicting inventory for all products in a database taking into account all previous inventory as well as say marketing data, building this in the R bindings would be quite a challenge, with mindsdb its a simple SQL statement

Would your offering be slower because the data needs to be transferred outside sql server and then back to publish model results?

I haven't looked into it myself, but I'll try to understand what they do better, thanks for letting us know.

I will say that:

1. It's not open source, so hard for us to compare other than running black-box experiments.

2. Oracle, so presumably that comes with all the Oracle-ecosystem buy-ins that implies, which might not be ideal for many people.

As a purely personal opinion:

I guess it's good to know that other people are thinking in the same direction as us, but at the same time I personally would like for widely-used ML libraries to be open-source. If these models are going to be used as generator of important decision making algorithms, ideally both the model and the algorithm should be open source. The later is up to whoever is building the algorithm, but I think if we can get the zeitgeist to move towards the later being open source as the norm that can alleviate a lot of potential harm and has little downside.

I.e. Do you feel comfortable with the NHS off-sourcing important decision making to algorithms that are proprietary black boxes? Considering that it's funded by the tax paying public and it's supposed to service that public.

"Secret" laws used to be a norm in the past e.g. in large civilziations like the Roman empire, where the norm evolved to be that only "schooled" men could understand the law due to complexity, or in most of medieval Europe where the bible was foundational for morality but closed off to a small subset of the population that knew Greek or Latin and could get their hands on it. But in general that seems to have caused more harm than good.

It seems reasonable to ask that, if algorithms are going to be used by governments in decision making, those should be entirely open. Ideally the ones used by corporations should be open to whatever degree is possible, to avoid run-off harm from buggy or unaligned systems.

you are right!! the main thing is to offer it for other databases (mysql, mariadb, postgres, clickhouse, clickhouse, timescale, mongodb) as well as to support more powerful machine learning capabilities than the vanila classical models supported by oracle, for instance great timeseries support

Can I use it in mySQL?

Yes, you can. Check out the MySQL docs https://docs.mindsdb.com/datasources/mysql/

Amazing launch guys! I really like it.

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact