Hacker News new | past | comments | ask | show | jobs | submit login
Array Databases: Concepts, Standards, Implementations [pdf] (rd-alliance.org)
76 points by teleforce 29 days ago | hide | past | favorite | 22 comments



It’s worth noting that the author of this report is Peter Baumann, the author of Rasdaman. So it shoukd be no surprise that Rasdaman comes out on top in the various benchmarks and is presented as the leading “array database.”

My two cents (as the author of Xarray, one of the Python libraries mentioned in this report) is that it’s questionable whether we need “array databases” at all. Certainly we need to be able to store arrays and compute with them, but do we need an integrated solution that does both at the same time with a query language that looks like SQL? Maybe not, in an era of cloud computing, prolific open source software and when everyone who works with big array datasets already knows Python.


You cannot automatically conclude that rasdaman comes on top because the author is also involved in its development, although it may be suspicious. I am also one of the authors and contributed to doing the benchmarks: our goal was to configure and implement the queries in the best way for each system and achieve comparable results. Note that this report was done almost three years ago, the results may be getting out of date.

This query language that looks like SQL is an official part of SQL now [1]. Surely there is place for integrated DB solutions that let you work with both relational and array data in one place? There are more benefits in this than just performance/scalability. Think of building services on top of big array datasets, beyond one-off data science experiments.

1. https://www.iso.org/standard/67382.html


I agree that array processing probably doesn't need to live in a database, but I think databases should base their foundations on arrays.

In kdb you go from primitives to arrays to tables. In SQL you go from primitives straight to tables which makes it cumbersome to do any simple one column or array ops. Such as excluding a column from a select expression.

Compare first class array support in sql vs hypothetical programmable sql

    select table.* - {name, age}
    from table
vs

    select (
        select column
        from table
        where column
          not in ('name', 'age')
    )
    from table


> In SQL you go from primitives straight to tables

This make a lot of sense. Because primitives ARE "tables", columns ARE "tables".

A primitive is a relation of one column/row. This is what allow you to do:

    SELECT * FROM (SELECT 1)a
What sql/rdbms not do it well is to exploit this very well.


Xarray is absolutely sublime by the way, thank you for your work there. I stuffed around with multi-indices in pandas for a good while before finding Xarray and instantly having all my problems solved :)


If you can't compute where the data is, you will end up having to pull back all the data to calculate against it. Assuming the cost of transfer for the full data set is >50% compared to performing at least some calculations that reduce the size, it's worth it.


This is Stonebraker's argument for shared-nothing architecture and it applies well for interactive ad-hoc analytics on well structured data.

Many orgs these days store all data in data lake shared-disk architectures and pull down the subsets. The performance hit of pulling down data over high bandwidth channel such as s3 - ec2 is much more reasonable to companies than storing everything on expensive compute instances just so that the "data would be there" ready for querying if somebody ever needs it.


I'm curious how google is making money with array technologies. Is it mostly marketing to get geodata people to use GCP, or is there some other product?


The short answer is that I wrote Xarray before I showed up at Google :)


Interesting discussion evolving! Disclaimer: I am co-author of that paper as well.

That said: you may call it "suspicious" if the authors come to that conclusion, but on the same grounds it is likewise suspicious if the writer of a tool that has not excelled doubts the results :)

Let us rather look at facts - the benchmark is published and open, and actually similar figures have been reported by other, completely independent benchmarks. The paper has undergone tough scrutiny by 5 independent experts in the field before publication.

Doubting about the value of databases reminds me of the old times of COBOL vs SQL: "we don't need SQL data management". Incidentally, IT world since then has embraced databases of all kinds...and exactly arrays should be the big exception? That does not make sense. Tools will need to accept that there are other tools as well, and ideally discussion is merit-based.

"everyone who works with big array datasets already knows Python"...well, if the only tool you know is a hammer then...ya know. There are so many more worlds than just your comfort zone, python! Just think of R, for example.

PS: I am not questioning xarray - in projects we have used xarray as a frontend to rasdaman, and this combination works like a charm: python wrapper around scalability and federation, connected through the open OGC standards.


Hi Peter, good to run into you again :)

I agree, formulating open benchmarks is great work, and there's nothing suspicious about performing well on benchmarks you write. It's just worth keeping in mind: https://matthewrocklin.com/blog/work/2017/03/09/biased-bench...

My initial remarks were a little careless and overly provocative! I don't doubt that there are cases where a true "array database" provides value. I do think the use-cases are less clear for arrays than they are for tabular data, because the users of arrays tend to be more sophisticated.


By the way, is there another reference with a full description of how the benchmarks are setup? I'd be curious how Xarray (end user API) + Dask (distributed compute) + Zarr (distributed storage) compares.

Xarray definitely takes a different philosophical approach based on its roots in the Python data science ecosystem, compared to "all in one" solutions like a full array databases.


another PS: the article is not only about benchmarking, but also about a detailed, deep functionality comparison of 19 array tools.


Come on. Two typos in the first few pages; p.2 "servicees", p.6 "lanuage". I know you are smarter than me, but if seven people decided that an editor would be useless, or spellcheck was unnecessary, then you should be prepared for unwarranted, biased criticisms of your conclusions. You're better than this.


I'm not convinced that "statistics" or "OLAP" is a real use case for array dbs (p. 5). Do you have any examples?


oh, sure - both model multi-dimensional situations, just not necessarily with spatio-temporal semantics (see the famous data warehouse example of a sales cube: time x products x subsidiaries). Operations on abstract level are relatively similar - a rollup from days to weeks is pretty similar to scaling an image by 7. Then, there are various differences in detail, depending on the domain.


Instead of an array database, I would like to have an array extension for postgres.


PostgreSQL can just implement the SQL/MDA (MUlti-Dimensional Arrays) standard; they are famous for implementing virtually all of SQL, so this might come in future. PostGIS Raster is an on-top attempt which is benchmarked in the paper as well BTW.


Why is the chart on page 8 so blurry?


hm, on p 8 I do not see a graphics, can you give the number please? If it is Fig 4: I had quite some interaction with the final editing team as they wanted to have every graphics with full page width, regardless of the intended (and submitted) size. In the end they suceeded. Fig 5 is a "nice" example. Also the table formatting in the Annex was nicely structured through colors, all gone.


Fig 2 is on Page 8... and it's been stretched to 100% width which has resulted in anti-aliased text on the diagram.


ah, I see now - the reference is to the RDA article, not to the recent update published in the Springer Big Data Journal: https://journalofbigdata.springeropen.com/articles/10.1186/s...

my bad, sorry!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: