Wow, I probably have seen 10 of these kind of companies over the past few months...

_mdb · on April 28, 2020

CEO of Tecton here, and happy to give more context. Tecton is specifically focused on solving a few key data problems to make it easier to deploy and manage ML in production. e.g.:

- How can I deliver these features to my model in production?

- How do I make sure the data I'm serving to my model is similar to what is trained on?

- How can I construct my training data with point in time accuracy for every example?

- How can I reuse features that another DS on my team built?

We've found that there's a ton of complexity getting data right for real-time production use cases. These problems can be solved, but require a lot of care and are hard to get right. We're building production-ready feature infrastructure and managed workflows that "just work" for teams that can’t or don’t want to dedicate large engineering teams to these problems.

At the core of Tecton is a managed feature store, feature pipeline automation, and a feature server. We’re building the platform to integrate with existing tools in the ML ecosystem.

We’re going to share more about the platform in the next few months. Happy to answer any questions. I’d also love to hear what challenges folks on this thread have encountered when putting ML into production.

bogomipz · on April 28, 2020

All of the open positions listed on your careers page appear to be broken. There is no field to upload or attach a CV when applying to any of the roles. Also why would a LinkedIn Profile be mandatory in order to apply for a role? There are many qualified people who have simply chosen not to be a part of that social network.

_mdb · on April 28, 2020

Ah. We're on it. LinkedIn shouldn't be required. Thanks for flagging.

jdoliner · on April 28, 2020

Pachyderm is probably one of the companies you've seen in this space. Full disclosure: I'm the founder, but I feel that we've stayed pretty true to the idea of being a modular open-source tool. We have customers who just use our filesystem, and customers who just use our pipeline system, and of course many more who use both. We've also integrated best in class open-source projects, for example Kubeflow's TFJob is now the standard way of doing Tensorflow training on Pachyderm, and we're working on integrating Seldon as the serving component. We find this architecture a lot more appealing than an all-in-one web interface that you load your data into.

gas9S9zw3P9c · on April 28, 2020

I haven't used you yet, but IMO this is the way it should be done. Once I get around to cleaning up my current custom k8s pipelines I will give you a spin :)

minimaxir · on April 28, 2020

Additionally, all of Google, Amazon, and Microsoft are pushing very heavily in the ML DevOps space. And if you are training/deploying ML models at such a frequency that you need to utilize DevOps, chances are you are already using their platforms for server compute.

0xbadcafebee · on April 28, 2020

Open Source companies are like open source car manufacturers. When the company dies and stops making the car, will the customers start a new car manufacturing business just to support their cars? Or buy a new car?

As AWS shows, proprietary all-in-one [platform] is fine as long as it's a-la-carte.

yanovskishai · on April 28, 2020

Could you please mention what are the other solutions you've got to see in this space?

mmq · on April 28, 2020

Polyaxon is an open source machine learning automation platform. It allows to schedule notebooks, tensorboards, and container workloads for training ML and DL. It also has native integration with Kubeflow's operators for distributed training.

https://github.com/polyaxon/polyaxon

verdverm · on April 28, 2020

https://dolthub.com is the cool kid right now. There is pacaderm, git lfs, IPFS.

Really what we need is version control for data, it's not just an ML data problem. It's a little different though, because you would like to move computation to data, rather than the other way around

wenc · on April 28, 2020

The utility of version controling production-sized (not sample training data) data (as opposed to code) is something I've having trouble grasping unless I'm missing something here -- and I may be, so please enlighten me.

It seems to me to be able to time-travel in data you almost need to store the Write-Ahead Log of database transactions and be able to replay that. Debezium captures the CDC information, but it's a infrastructure level tool rather than a version control tool.

In data science, most time-travel issues are worked around using bitemporal data modeling: which is a fancy way of saying "add a separate timestamp column to the table to record when the data was written". Then you can roll things back to any ETL point in a performant fashion. This is particularly useful for debugging recursive algorithms that get retrained every day.

But these are infrastructure level approaches. I'm not sure that it's a problem for a version control tool.

timsehn · on April 28, 2020

Tim , CEO of Liquidata, the company that built Dolt and DoltHub here. This is how we store the version controlled rows so that we get structural sharing across versions (ie. 50M + one row chgange becomes 50M+1 entries in the database not 100M with no need to replay logs):

https://www.dolthub.com/blog/2020-04-01-how-dolt-stores-tabl...

wenc · on April 28, 2020

Thanks, that looks like an interesting approach. I may have missed this in the article, but let's say I have a SQL database with 600m records, and an ETL process does massive upserts (20m records) every day, with many UPDATEs on 1-2 fields.

Wouldn't discovering what those changes are still entail heavy database queries? Unless Dolt has a hook into most SQL databases' internal data structures? Or WALs?

timsehn · on April 28, 2020

You have to move your data to Dolt. Dolt is a database. It's got its own storage layer, query engine, and query parser. Diff queries are fast because of the way the storage layer works.

Right now, Dolt can't be distributed (ie. data must fit on one hard drive) easily so it's not meant for big data, more data that humans interact with, like mapping tables or daily summary tables. But, long term if we can get some traction, we plan on building "big dolt" which would be a distributed version that can scale to as big as you want.

wenc · on April 28, 2020

Ah now I understand!

So for most analytic workloads, typically a columnstore db is used due to the need for performance and advanced SQL features (windowing functions) for complex analytic queries -- which I don't expect Dolt to replace. Which means if we wanted to use Dolt's features, we would have to continuously ETL the data into Dolt, which would entail mirroring the entire database (or at least the parts we want to version control).

Dolt essentially becomes a derived database specifically used for versioning. I see how this might work for some use cases.

seddonm1 · on April 28, 2020

If you are working within the Apache Spark ecosystem you can us DeltaLake https://delta.io/ to create 'merge' datasets which are transactional, versioned and allow time travel by both version number and timestamp.

jamesblonde · on April 29, 2020

Another alternative to Deltalake is Apache Hudi, which also includes bloom filters for indexing time-travel queries (efficiently exclude any files given the supplied time constraint). Z-ordered indexing in Deltalake is not available yet in open-source deltalake, only in Databricks version.

zachmu · on April 28, 2020

One of the cool things about Dolt is that you can query the diff between two commits. This functionality is available through special system tables. You specify two commits in the WHERE clause, and the query only returns the rows that changed between the commits. The syntax looks like:

`SELECT * FROM dolt_diff_$table where from_commit = '230sadfo98' and to_commit = 'sadf9807sdf'`

jacques_chester · on April 28, 2020

> In data science, most time-travel issues are worked around using bitemporal data modeling: which is a fancy way of saying "add a separate timestamp column to the table to record when the data was written".

Not quite, this is "transaction time". You also need "valid time" to be truly bitemporal. Recovering the database as of some point in time is not enough to answer questions like "when will this fact become false?" or "when did our belief about when it would become false change?", because you didn't preserve assertions about the time range over which the fact was held to be true.

In terms of implementations, ranges are better than double timestamps. They provide their own assertion of monotonicity and can be easily used in exclusion indices.

I found that Snodgrass's textbook was a good introduction to the concepts and it's available for free: https://www2.cs.arizona.edu/~rts/tdbbook.pdf

wenc · on April 28, 2020

Yes, you're correct -- an omission on my part. You need "valid time" (otherwise it's just "uni"-temporal modeling).

Thank you for the link to Snodgrass' book. I've not seen a formal book on temporal modeling in SQL before, so this is fascinating.

jacques_chester · on April 28, 2020

Glad I could help! The research seems to have puttered on for a while after this book was written, but appears to fizzle out by around the turn of the millennium.

Some notion of bitemporalism showed up in SQL 2011, but somewhat constrained compared to what Snodgrass describes.

sgt101 · on April 28, 2020

I worry about retraining every day. Isn't that a flag that says "It hasn't learned a thing and actually I'm just improving my backfitting score"?

wenc · on April 28, 2020

Not really -- in many forecasting applications in fast-changing markets, it is fairly common to dynamically retrain your recursive model to a moving window of historical data in order to adapt to your current environment (with some regularization). The length of the window depends on how fast the market changes.

For these types of recursive model applications, you cannot just fit the model once and forget about it.

slt2021 · on April 28, 2020

as long as it works well on out of sample data at deployment time, it is okay.

Until some major data drift happens, but you would notoce it anyways

sgt101 · on April 29, 2020

Honestly, I've heard people in Vegas tell me the same about their strategies vs. slots. Genuinely, if you have made money from this - well done, take it out now, congratulate yourself. If you haven't...

yanovskishai · on April 28, 2020

Thanks ! There are indeed players many new in the data versioning space (DVC and Quilt also probably worth mentioning).

I totally agree that data management problems are not just ML related. But I personally think that there are additional challenges in the space that are not just version control for data.. all the area of data quality management and monitoring for example. I liked the analogy to devops, source version was super critical problem to solve in software development, but it didn't stop there, with things like CI/CD etc. I believe we'll see similar evolution in the data space..

SirOibaf · on April 28, 2020

https://logicalclocks.com with their ML + Feature Store open source platform Hopsworks and their managed cloud version https://hopsworks.ai

jamesblonde · on April 28, 2020

Disclaimer: i am a co-founder of Logical Clocks. There are loads of interesting technical challenges in this "Feature Store" space. Here are just a few we address in Hopsworks:

1. To replicate models (needed for regulatory reasons), you need to commit both data and code. If you have only a few models, fine just archive the training data. But, if you have lots of models (dev+prod) and lots of data - you can't use git-based approaches where you commit metadata and make immutable copies of data. It scales (your data!) badly. We are following the ACID datalake approach (Apache Hudi), where you store diffs of your data and can issue queries like "Give me training data for these features as it was on this date".

2. You want one feature pipeline to compute features (not one for training and a different one when serving features). Your feature store should scale to store TBs/PBs of cached features to generate train/test data, but should also return feature vectors in single ms latency for online apps to make predictions. What DB has those characteristics? We say none, and we adopt a dual-DB approach with one DB for low-latency and one for scale-out SQL. We use open-source NDB and Hive on our HopsFS filesystem - where all 2 DBs and the filesystem share the same unified, scale-out metadata layer (a "rm -rf feature_group" on the filesystem also automatically cleans up Hive and feature metadata)

3. You want to be able to catalog/search for features using free-text search and have good exploratory data analysis. The systems challenge here is how to allow search on your production DB with your features. Our solution is that we provide a CDC API to our Feature Store, and automatically sync extended metadata to Elastic with an eventually consistent replication protocol. So when you 'rm -rf ..' on your filesystem, even the extended metadata in Elastic is automatically cleaned up.

4. You need to support reuse of features in different training datasets. Otherwise, what's the point? We do that using Spark as a compute engine to join features from tables containing normalized features.

References:

* https://www.logicalclocks.com/blog/mlops-with-a-feature-stor... * https://ieeexplore.ieee.org/document/8752956 (CDC HopsFS to Elastic) * http://kth.diva-portal.org/smash/get/diva2:1149002/FULLTEXT0... (Hive on HopsFS)

timsehn · on April 28, 2020

Here's a list of companies/tools in the Git for Data space:

https://www.dolthub.com/blog/2020-03-06-so-you-want-git-for-...

chaoyu · on April 28, 2020

I'm actually building a "modular open-source company/product" in the MLOps space:

BentoML https://docs.bentoml.org/en/latest/

simonw · on April 28, 2020

https://angel.co/companies?keywords=machine+learning+models+... lists a whole bunch of them.

dttos · on April 28, 2020

Composable https://composable.ai is another tool in this space

tixocloud · on April 28, 2020

Completely spot-on. Too many "all-in-one" platforms are just too rigid and with AI infrastructure tooling still in the early stages, the companies that adopt modular products will be able to capitalize on new advances.

alfalfasprout · on April 29, 2020

Yeah, we're releasing our platform as open source soon too... kinda feel bad for these guys but it'll be tough to compete with platforms that have a larger open source following and plenty of end-users.

fizixer · on April 28, 2020

I wonder what's the business model for teams/startups offering open-source solutions that they developed in-house.