
Ask HN: What are some of the most utilised patterns for querying large datasets? - extra_rice
I&#x27;m currently working on a software project where I need to query datasets that could be very large (maybe hundreds of thousands per single context), and then do some computations on the results. It&#x27;s basically, find some sort of &quot;median&quot; from the set, but it could be a bit more complex than that, like find the smallest, most common value. My impression is that most modern databases should be able to handle queries like this with some built-in mechamism. However, one of the concerns is that, because the datasets could be very large, queries would end up taking very long. The data being queried is also highly dynamic, so caching maybe a little tricky.<p>I&#x27;m pretty sure this isn&#x27;t something unique to this project, but I&#x27;m interested to know how other practitioners address this kind of situation. Also, to note, while I&#x27;m asking this in general terms, it&#x27;d be interesting to know how MongoDB users in particular handle this.
======
bjourne
Have you heard about the inverted index? It is the corner stone of all
databases and information retrieval systems. Your question is quite fuzzy so
it is hard to come up with a more precise answer.

~~~
extra_rice
Sorry, I didn't know how to make the question a bit clearer. Basically, it's:
how do you ensure that queries on very large, highly dynamic datasets return
in acceptable amount of time (especially if clients call/poll it at regular
short intervals)?

~~~
sethammons
indexes, caching (pass through, LRU, etc), query read replicas, sharding, pre-
fetching, sampling, maybe look into columnar storage ... Hard to answer not
knowing more specifics.

Something to always remember: if it is valuable, charge for it. If it is
really valuable, you can spend all kinds of hardware on it. Give each customer
their dedicated instance and rinse and repeat the strategies above.

------
snazzybazzy
also depends on your query pattern. Are you fetching many columns/rows or are
you looking for one particular row? What do you mean by "large" as well. Are
we talking GBs, TBs, or bigger?

