Hacker News new | past | comments | ask | show | jobs | submit | wizwit999's comments login

I like the format but these seem grammatically incorrect written down in arabic, e.g. missing articles etc, I guess you're going for street Arabic but you should have it in fusha, it's easier to go other way around


Ah yeah I should specify this is in the egyptian dialect


An acqui-hire one year after a $32M round, oof.


Thank god for that liquidation preference! said the VCs.


I say this as a personal rust fan, but practically if a language like a combination of Kotlin+Go existed, that'd be like an awesome standard business language. Kotlin is good but just very tied to the JVM, Go has a hostile syntax/developer experience. Another way to frame it is like Rust with a garbage collector. Most biz APIs don't really need zero cost abstractions of Rust.


Is being tied to the JVM that much of a problem these days? Developer workstations and servers have much more compute and RAM these days, and native compilation with GraalVM seems to eliminate high JVM startup times and memory overhead. Quarkus for example is known for a snappy developer experience.


If you're interested, take a look at Crystal (https://crystal-lang.org/)!


I see OCaml described as "Rust with a GC" pretty often, but maybe it's too heavily functional for what you're seeking.


This is Swift. It‘s just not being used a lot outside of iOS / Macos dev.


you know there's this great abstraction over files we came up with in the data world called 'tables'..


If you want something that looks like a table, while still benefiting from not having to download 8GB of Parquet in order to run the query, you can get one using a CTE:

    with midjourney_messages as (
        select
            *
        from read_parquet(
            list_transform(
                generate_series(0, 2),
                n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' ||
                    format('{:06d}', n) || '.parquet'
            )
        )
    )
    select sum(size) as size from midjourney_messages;
Or you can create a view:

    create view midjourney_messages as
    select * from read_parquet(
        list_transform(
            generate_series(0, 2),
            n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' ||
                format('{:06d}', n) || '.parquet'
        )
    );
    select sum(size) as size from midjourney_messages;


Are there any pros/cons to the raw query, CTE, and view approches?


Yes.

Using a view in this example, you can't dynamically change which files are being selected (not even with joins or where clauses). What if new files are generated and suddenly there are more or less files? Therefore you probably wouldn't want to encapsulate your SQL into a view. Most of the time you would probably bind the list of files in your SQL as needed:

   SELECT SUM(size) AS size FROM read_parquet(:files);
But in this case, a table macro/function might also be an option:

   CREATE MACRO GET_TOTAL_SIZE(num_of_files) AS TABLE (
        SELECT
            SUM(size) AS size
        FROM read_parquet(
            list_transform(generate_series(0, num_of_files), n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' || format('{:06d}', n) || '.parquet'
            )
        )
    );

    SELECT * FROM GET_MJ_TOTAL_SIZE(55);
Not necessarily related to this article, but CTEs are useful for breaking down a complex query into more understandable chunks. Moreover, you can do interesting things within the CTE's temp-tables like recursion or freezing/materializing a temp-table's results so that it only gets evaluated one time, instead of every time it gets referenced.

http://duckdb.org/docs/sql/query_syntax/with#recursive-ctes http://duckdb.org/docs/sql/query_syntax/with#materialized-ct...


From my experience with other databases my assumption for DuckDB is:

- Using a raw query, a CTE or a view will have no impact at all on query performance - they'll be optimized down to exactly the same operations (or to a query plan that's similar enough that the differences are negligible)

- CTEs are mainly useful for breaking down more complicated queries - so not great for this example, but really useful the moment you start doing anything more complicated.

- Views are effectively persistent CTEs - they're great if you want to permanently "bookmark" more complex pieces of queries to use later.

I wrote a bit more about CTEs here: https://datasette.io/tutorials/data-analysis#ctes


and guess what those tables use to store data a lot of the time? Parquet.

I assume that when you say "tables" you mean "external tables" since you're in the data world.

If you didn't then I guess when you say "tables", you mean "tables" with everything including the kitchen sink? Database, compute, etc...? Does the database always have to be running for the table to be accessible? Responsibility for the database hardware, resources or services?

Of course there are fantastic new databases like Snowflake and BigQuery that separate compute and storage... but do they, really? Separating storage and compute is just incredible for scaling, suspend/resume, etc. But can you query a Snowflake/BigQuery table without also having to use their compute? Is there a way that I can just get a "table" and not be forced into using a specific compute-engine and all the other bells and whistles?

So when you say "table", where and how do I get one? And to maintain the theme of the article, a columnar/OLAP/analytics "table" in particular?

As you probably know, there are several (external table) options, Apache Iceberg probably being the most talked about one at the moment. External "tables" are just collections of metadata about your files, or conventions about how to lay your files down. When you query these tables with SQL using athena, redshift, snowflake, duckdb, etc... each and everyone of those query-engines is reading parquet files.

(Snowflake, BigQuery and others are working on features to both manage and read Iceberg tables, so i kinda lied earlier)


ye even less relevant for that usecase with snapstart.


SnapStart for Java 17 was only released very recently, so if you are a bit early in the cycle it doesn't help. Don't even want to guess when it will be available for Java 21.


yes something like kotlin + go would be excellent


Apache Iceberg is kind of this, but more oriented around large data lake datasets.


This is nice! In Matano, we take a similar approach but with Rust + serverless for pulling SaaS logs (https://github.com/matanolabs/matano/tree/main/lib/rust/log_...) and storing them in a data lake.


Perhaps this is true for business data (though I'm skeptical of the claims), but, for example, for security data, this isn't true at all. Collecting cloud, identity, SaaS, and network logs/data can easily exceed hundreds of terabytes. A big reason why we're building Matano as a data lake for security.

It seems an odd pitch in general to say, hey my product specifically performs poorly on large datasets.


On the contrary, identifying what your product is explicitly not aiming to do is extremely helpful. "Big" adds a lot of complexity and pain, most people don't do that, our product avoids the complexity and pain and is the best choice for most people. Seems like a good, simple pitch, and all it requires is the humility to say that your solution isn't the best for some use cases.


Sounds like you're in the "Big Data One-Percenter" category described at the very bottom of the article.


Thank you, fixed the link!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: