Hacker News new | past | comments | ask | show | jobs | submit | masio12's comments login

I think is a great idea because as you mentioned quality Datasets can make your model work or not at all. However this is not addressing the big elephant in the room. Which is: no matter how much you curate or clean the data, you are limited to the dataset that you have. The big answer would be, how can you get more and better datasets. I think tooling is super important, but the big difference will be, how to collect/generate/capture reliable, defendable, datasets moving forward. I think your idea is complementary to this other project: https://delegate.dev


I absolutely, 120% agree on the importance of adding the right data. Aquarium helps you with: "what data should I be collecting to improve my model" and "where do I find that data?"

For the latter, Aquarium treats the problem of smart data sampling as a search and retrieval problem. You want to find more examples of a "target" from a large stream of unlabeled data. Aquarium does this by comparing embeddings of the unlabeled data to your "target set" and then sending examples to labeling if they're within a defined distance threshold in embedding space. We don't actually do the labeling, but we wrap around common labeling providers and can integrate into in-house flows with our API.


Other founder here! For a high level overview of this framing of the problem, I recommend reading this Waymo blog post [1].

One nice feature is that by using embeddings produced by a user's model, which has been trained in the context of their domain, we can do this sort of smart sampling in domains we've never seen before. Embeddings are also naturally anonymized, so we can do this without access to a user's potentially private raw data streams.

[1] https://blog.waymo.com/2020/02/content-search.html


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: