Hacker News new | past | comments | ask | show | jobs | submit | datascientist's comments login


How does this relate to https://github.com/lancedb/lance


Lance is just a data format. Lance DB might be more comparable to DataChain.

DataChain focuses on data transformation and versioning, whereas LanceDB appears to be more about retrieving and serving data. Both designed for multimodal use cases.

From technical side: Lance has it's own data format and DB engine while DataChain utilizes existing DB engines (SQLite in open-source and ClickHouse/BigQuery in SaaS).

In SaaS, DataChain has analytics features including data lineage tracking and visualization for PDFs, videos, and annotated images (e.g., bounding boxes, poses). I'm curious to understand the unique value of LanceDB's SaaS — insight would be helpful!

You could think of it as OLTP (Lance) versus OLAP (DataChain) for multimodal data, though this analogy may not be perfect.


How about daft https://github.com/Eventual-Inc/Daft - also looks like a new multimodal dataframe framework


One of the maintainers of Daft here.

Just dug through the datachain codebase to understand a little more. I think while both projects have a Dataframe interface, they're very different projects!

Datachain seems to operate more on the orchestration layer, running Python libraries such as PIL and requests (for making API calls) and relying on an external database engine (SQLite or BigQuery/Clickhouse) for the actual compute.

Daft is an actual data engine. Essentially, it's "multimodal BigQuery/Clickhouse". We've built out a lot of our own data system functionality such as custom Rust-defined multimodal data structures, kernels to work on multimodal types, a query optimizer, distributed joins etc.

In non-technical terms, I think this means that Datachain really is more of a "DBT" which orchestrates compute over an existing engine, whereas Daft is the actual compute/data engine that runs the workload. A project such as Datachain could actually run on top of Daft, which can handle the compute and I/O operations necessary to execute the requested workload.


Good question! I’m not so familiar with it.

It looks like Daft is closer to Lance with it’s own data format and engine. But I’d appreciate more insights from users or the creators.



Alignment is hard: "If the LLM has finite probability of exhibiting negative behavior, there exists a prompt for which the LLM will exhibit negative behavior with probability 1." Source: Fundamental Limitations of Alignment in LLMs https://arxiv.org/abs/2304.11082


Creators of the data quality tool for computer vision, fastdup, continue to improve on their free release https://github.com/visual-layer/fastdup

Here's a short video of some recent results for LAION 400M https://www.youtube.com/watch?v=dlRCm29Upu4


Recent perspectives from the creators of Prefect, Dagster, Flyte, and Orchest => https://gradientflow.com/summer-of-orchestation/


Seems to be missing Temporal/Cadence, which I'm very excited about, but I've never heard of Flyte or Orchest.


If you need to scale out or speedup pandas, there's Modin https://modin.readthedocs.io/en/latest/ (which uses Ray from)


Gunnar Carlsson will be teaching a related tutorial ("Using topological data analysis to understand, build, and improve neural networks") on April 16th in New York City https://conferences.oreilly.com/artificial-intelligence/ai-n...


RISE Lab's Ray platform (now includes RLlib) is another option https://www.oreilly.com/ideas/introducing-rllib-a-composable...


Chainer's Define-by-run apporach is also described here https://www.oreilly.com/learning/complex-neural-networks-mad...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: