
Ask HN: Database for efficiently storing many numpy arrays - malux85
I&#x27;m looking for a database that can store dense N-dimensional numpy arrays. Query performance is not as important as storage efficiency, the arrays are dense but mostly 0&#x27;s and 1&#x27;s and compress well.<p>I tried pg&#x27;s array support (very slow) and storing it in pg&#x27;s JSON field (also quite slow, but more acceptable)<p>I should also add, we&#x27;re going to be writing a lot more than reading -- potentially for many processes, so anything with ACID-like support would be great.<p>But there must be dedicated databases out there for dense numerical data like this?
======
dodgyb
pyarrow and parquet may fit your requirements. There are a number of analysis
engines (spark, drill, hive, etc) that support parquet. For more detail see:

[https://tech.blue-yonder.com/efficient-dataframe-storage-
wit...](https://tech.blue-yonder.com/efficient-dataframe-storage-with-apache-
parquet/)

[https://www.slideshare.net/julienledem/strata-
ny-2017-parque...](https://www.slideshare.net/julienledem/strata-
ny-2017-parquet-arrow-roadmap)

If ACID guarantees are a must then use Kafka as a message broker between
memory and file, but this has the cost of added complexity. For more info:

[https://eng-staging.verizondigitalmedia.com/2017/04/28/Kafka...](https://eng-
staging.verizondigitalmedia.com/2017/04/28/Kafka-to-Hdfs-ParquetSerializer/)

[http://activisiongamescience.github.io/2016/06/15/Kafka-
Clie...](http://activisiongamescience.github.io/2016/06/15/Kafka-Client-
Benchmarking/)

------
reacharavindh
Not databases as you asked for, but, I stumbled on this data while looking for
persisting numpy arrays. You might find it interesting:
[https://github.com/mverleg/array_storage_benchmark/blob/mast...](https://github.com/mverleg/array_storage_benchmark/blob/master/README.rst)

