
Ask HN: Database for storing machine learning data? - ThePhysicist
I&#x27;m looking for a database that can efficiently store and retrieve a very large number (billions) of structured datapoints for use in machine learning. Each datapoint can have an arbitrary number of categorical and numerical attributes and belong to one or more datasets.<p>I want to be able to quickly (ideally in several seconds at most for result sets with 1.000-1.000.000 datapoints) select datapoints of a given dataset and possibly filter them based on their attribute values, e.g. formulating queries like &quot;give me all datapoints belonging to dataset A for which x &lt; 4.5 AND category = &#x27;test&#x27; AND event_date &gt;= &#x27;2009-04-10&#x27;&quot;. Once written, datapoints will not change, though I would like to attach additional information to specific datapoints (e.g. test results or additional labels), which could be done in a separate data structure or table though.<p>Right now I&#x27;m solving this using a simple PostgreSQL database with auxiliary index tables, but I&#x27;m looking for more scalable alternatives.<p>I&#x27;ve considered software like Cassandra or Clickhouse but I&#x27;m not sure they will fit my use case well. Do you have any recommendations or did you realise such a system in your work and can provide some ideas or guidance? Thanks!
======
pachico
I use ClickHouse for analytical purposes and I managed to ingest with very
modest hardware up to 5 million rows per second. I stopped there but with more
multiple jobs I might achieve even more. Queries and export are very fast too.
At the moment I cannot think of anything better for this. Let me know if you
need extra info.

------
tarun_anand
Define quickly. One minute, one hour?

What is the downstream use? To train, label?

~~~
ThePhysicist
I clarified that now, thanks. Multiple use cases e.g. selecting datapoints for
training or testing as well as to analyze statistical properties of specific
datasets. So retrieving individual datapoints should be possible as well. Once
datapoints are written they will not get updated, if labeling takes place it
would probably be done in another structure that gets linked to the datapoint.

