Almost no database systems support multidimensional arrays. So they are not appr...

aldanor · on Feb 4, 2021

I think that implementing good ndim=2 support would already be a huge leap forward, it doesn't have to be something super generic. E.g., given that most of the classic machine learning is essentially using 2-dimensional data (samples x features) as inputs, this is a very common use case.

E.g., as of right now, having to concatenate hundreds of columns manually just in order to pass them to some ml library in a contiguous format is always a pain and often doubles the max ram requirement.

lmeyerov · on Feb 4, 2021

This may help you do zero copy for a column of multi-dim without losing value types, just that it's encoding a multi-dim. This example is for values that are 3x3 of int8's:

```

import pyarrow as pa

my_col_of_3x3s = pa.struct([ (f'f_{x}_{y}', pa.int8()) for x in range(3) for y in range(3) ])

```

If using ndarrays, I think our helpers are another ~4 lines each. Interop with C is even easier, just cast. You can now pass this data through any Arrow-compatible compute stack / DB and not lose the value types. We do this for streaming into webgl's packed formats, for example.

What you don't get is a hint to the downstream systems that it is multidimensional. Tableau would just let you do individual bar charts, not say a heatmap, assuming they support rank 2's. To convert, you'd need to do that zero-copy cast to whatever they do support. I agree a targetable standard would avoid the need for that manual conversion, and increases the likelihood they use the same data rep.

Native support would also avoid some header bloat from using structs. However, we find that's fine in practice, it's metadata. E.g., our streaming code reads the schema at the beginning and then passes it along, so actual payloads are pure data, and skip resending metadata.

ahachete · on Feb 4, 2021

Postgres: https://www.postgresql.org/docs/current/arrays.html

mkl · on Feb 4, 2021

If you out a blank line between your bullet points, they'll display properly:

* BigQuery: no

* Redshift: no

* Spark SQL: no

* Snowflake: no

* Clickhouse: no

* Dremio: no

* Impala: no

* Presto: no

jhgb · on Feb 4, 2021

I suspect that AllegroCache accepts arrays with rank>=2, although I never got around to trying it out. (At the very least its documentation has nothing to say about any limitations on what kinds of arrays can be stored, so I'm assuming it stores all of them.)

est · on Feb 4, 2021

On a side note, Clickhouse had some Arrow support

https://github.com/ClickHouse/ClickHouse/issues/12284

zX41ZdbW · on Feb 4, 2021

ClickHouse has support for multidimensional arrays.