More

datascientist · 2025-01-08T02:08:35 1736302115

https://gradientflow.com/the-moderation-dilemma-a-balanced-l...

datascientist · 2024-11-05T01:29:00 1730770140

How does this relate to https://github.com/lancedb/lance

dmpetrov · 2024-11-05T02:21:26 1730773286

Lance is just a data format. Lance DB might be more comparable to DataChain.

DataChain focuses on data transformation and versioning, whereas LanceDB appears to be more about retrieving and serving data. Both designed for multimodal use cases.

From technical side: Lance has it's own data format and DB engine while DataChain utilizes existing DB engines (SQLite in open-source and ClickHouse/BigQuery in SaaS).

In SaaS, DataChain has analytics features including data lineage tracking and visualization for PDFs, videos, and annotated images (e.g., bounding boxes, poses). I'm curious to understand the unique value of LanceDB's SaaS — insight would be helpful!

You could think of it as OLTP (Lance) versus OLAP (DataChain) for multimodal data, though this analogy may not be perfect.

m0sth8 · 2024-11-05T02:38:04 1730774284

How about daft https://github.com/Eventual-Inc/Daft - also looks like a new multimodal dataframe framework

jaychia · 2024-11-05T23:03:23 1730847803

One of the maintainers of Daft here.

Just dug through the datachain codebase to understand a little more. I think while both projects have a Dataframe interface, they're very different projects!

Datachain seems to operate more on the orchestration layer, running Python libraries such as PIL and requests (for making API calls) and relying on an external database engine (SQLite or BigQuery/Clickhouse) for the actual compute.

Daft is an actual data engine. Essentially, it's "multimodal BigQuery/Clickhouse". We've built out a lot of our own data system functionality such as custom Rust-defined multimodal data structures, kernels to work on multimodal types, a query optimizer, distributed joins etc.

In non-technical terms, I think this means that Datachain really is more of a "DBT" which orchestrates compute over an existing engine, whereas Daft is the actual compute/data engine that runs the workload. A project such as Datachain could actually run on top of Daft, which can handle the compute and I/O operations necessary to execute the requested workload.

dmpetrov · 2024-11-05T03:24:58 1730777098

Good question! I’m not so familiar with it.

It looks like Daft is closer to Lance with it’s own data format and engine. But I’d appreciate more insights from users or the creators.

datascientist · 2024-10-09T19:29:39 1728502179

also see https://gradientflow.com/open-source-principles-in-foundatio...

datascientist · on May 15, 2023

Alignment is hard: "If the LLM has finite probability of exhibiting negative behavior, there exists a prompt for which the LLM will exhibit negative behavior with probability 1." Source: Fundamental Limitations of Alignment in LLMs https://arxiv.org/abs/2304.11082

datascientist · on Dec 13, 2022

Creators of the data quality tool for computer vision, fastdup, continue to improve on their free release https://github.com/visual-layer/fastdup

Here's a short video of some recent results for LAION 400M https://www.youtube.com/watch?v=dlRCm29Upu4

datascientist · on Aug 2, 2022

Recent perspectives from the creators of Prefect, Dagster, Flyte, and Orchest => https://gradientflow.com/summer-of-orchestation/

claytonjy · on Aug 2, 2022

Seems to be missing Temporal/Cadence, which I'm very excited about, but I've never heard of Flyte or Orchest.

datascientist · on Sept 16, 2019

If you need to scale out or speedup pandas, there's Modin https://modin.readthedocs.io/en/latest/ (which uses Ray from)

datascientist · on Feb 12, 2019

Gunnar Carlsson will be teaching a related tutorial ("Using topological data analysis to understand, build, and improve neural networks") on April 16th in New York City https://conferences.oreilly.com/artificial-intelligence/ai-n...

datascientist · on Jan 27, 2018

RISE Lab's Ray platform (now includes RLlib) is another option https://www.oreilly.com/ideas/introducing-rllib-a-composable...

datascientist · on Jan 24, 2017

Chainer's Define-by-run apporach is also described here https://www.oreilly.com/learning/complex-neural-networks-mad...