
Ask HN: Keep Last X entries per ID in larger DB - matttah
I&#x27;m trying to figure out the best solution for the following:<p>* Daily importing of ~500+ million rows of data, with ~250 million unique ids.
* I need to only keep the latest X entries per unique ID.  Older entries are discarded after X entries for that id has been achieved.
* Monthly will read out the entire dataset for processing<p>X can be anywhere from 1000 to 3000, it is static over the entire DB just depends on as we determine the best setting.  Since I don&#x27;t access the data more than once a day, or at the end of the month, I would prefer not to pay for storage.  There are over a billion unique id&#x27;s which I can partition by prefix or ranges.  Each individual entry per ID is fairly small with only an integer and two decimals stored.<p>What would you recommend as a data store for this?<p>Thanks!
======
prostoalex
Have a job queue running lower-priority DB queries. The queue workers select
all records for a given ID, then prune off older records above X.

Insert a fresh record immediately, since you know it’s recent. Upon successful
insert, fire off a queue request to go check on that ID.

~~~
matttah
At 500 million+ records, even a 0.01 second per lookup/update is going to be
too slow at that scale even in parallel, unless the DB is able to process in
parallel without lock contention extremely quickly.

~~~
prostoalex
An index on ID should help.

If it doesn’t, then set up a DB replica (good idea in any case), do SELECTs on
slave, DELETEs on master.

------
verdverm
Storage is cheap, analysis is expensive, sounds like you might be optimizing
for the wrong variable. The data store of choice will likely depend on you
access patterns during analysis.

Flat files is my best guess for your data? HDFS

~~~
matttah
Agreed that storage is cheap, unless it's in a live cluster. Right now I keep
100% of the data in Redshift then use window functions to unload the latest X
per id. Keeping it in real time I haven't tried but right now the unloading of
the full dataset of last X per id takes multiple days on a 6 node cluster.

The analysis at end of month is simply give me 100% of the data set chunked up
by ID. All analysis is done outside of the system.

Flat files is my thinking right now with S3 and prefixes per partition, I'm
not sure on file format, since one thing is with each day's data being able to
process and update existing data quickly. Current thought is to load current
day's data to Redshift -> unload sorted by id -> process concurrently. With
multiple prefixes on S3 I won't hit the rate limits. My main worry is if read
in, loading/parsing each file will take too long to be scalable at 250-500
million unique id's per day. I wanted to check here before going down that
route to see if anyone had a different recommendation.

~~~
verdverm
Some things I might try...

    
    
        1. Hadoop / HDFS / Spark on an ephemeral cluster with disk snapshots
        2. Group 1M ID's into a single file
        3. If analysis is once a month, save daily then prep data right before analysis.
        4. Consider using Cassandra database
        5. Rent a big machine where the data can fit into memory

------
ryanworl
BigQuery is probably the best price/effort ratio for this.

