Thank you! The thing I find tricky here is choosing the weight. I think maybe one obvious way you would want to weight samples would be for recency. E.g. if I have a table of user login events then I care about seeing more of the ones that happened recently but still want to see some of the older ones. Would the algorithm still work if I converted a `created_at` timestamp to epoch time and used that? Or would I want to normalize it in some way?
All of your `created_at` epoch times are going to be of similar magnitude, since they’re all about the same relative distance from the start of the epoch (usually 1970-01-01T00:00:00). So if you use them as weights directly, you’re in effect taking a uniform sample – every row has about the same chance of being chosen.
Based on what I glean about your scenario, I suspect you’d be better served by some kind of exponentially decaying weight based on age (i.e., age = current_timestamp ‒ created_at). For example, if you wanted to make your rows half as likely to be selected every time they grew another hour older, you could use `POW(2.0, -age / 3600)` as your weight, where age is given in seconds.
duckdb> WITH Ages AS (
SELECT age FROM UNNEST([0, 3600, 7200]) AS t(age)
)
SELECT age, POW(2.0, -age / 3600) AS w FROM Ages;
┌──────┬──────┐
│ age ┆ w │
╞══════╪══════╡
│ 0 ┆ 1.0 │
│ 3600 ┆ 0.5 │
│ 7200 ┆ 0.25 │
└──────┴──────┘