
Show HN: Infreqdb – S3 backed key/value database for infrequent read access - sajal83
https://github.com/turbobytes/infreqdb
======
spullara
Be careful using S3 for lots of small writes. The price is utterly dominated
by the PUT calls which cost $0.01/1000.

~~~
sajal83
Yes. I should include this info in the README calcs. For my usecase I intend
to do 4 - 10 PUTs per hour, since they are batched. 10 PUTS/h = 7300
PUTs/month = $0.073/month. The key here is _infrequent_

~~~
spullara
Be most careful with retries. Saw a case where the bucket wasn't writable and
the code just retried in a loop. You get charged for the API call, successful
or not.

------
marknadal
Stuff like this is great! And you said you mentioned you are primarily doing
it for timeseries data - then you can easily batch writes to handle high
volume throughput too!

We did this with S3 as the storage engine, 100M+ records for $10/day:
[https://www.youtube.com/watch?v=x_WqBuEA7s8](https://www.youtube.com/watch?v=x_WqBuEA7s8)

And discord had a very nice article on this as well:
[https://blog.discordapp.com/how-discord-stores-billions-
of-m...](https://blog.discordapp.com/how-discord-stores-billions-of-
messages-7fa6ec7ee4c7)

Great work, I think there is a lot of exciting stuff you can add to it!

~~~
sajal83
That's cool. Last week I tried googling for similar stuff but all I could find
was people asking "How to run postgres on S3"...

------
vtuulos
We use a similar approach with TrailDB. You can make S3 access totally
seamless with user-space page fault handling, which is pretty cool:
[http://tech.adroll.com/blog/data/2016/11/29/traildb-
mmap-s3....](http://tech.adroll.com/blog/data/2016/11/29/traildb-mmap-s3.html)

~~~
sajal83
This is pretty cool. Do you invalidate all pages if the file changes upstream
on S3?

~~~
vtuulos
In our case all blobs in S3 are immutable, so no need to invalidate anything.

------
udkl
How is this different than AWS Athena ? -
[https://aws.amazon.com/blogs/aws/amazon-athena-
interactive-s...](https://aws.amazon.com/blogs/aws/amazon-athena-interactive-
sql-queries-for-data-in-amazon-s3/)

Also why don't you dump the log data into a NoSQL like dynamoDB instead of S3
?

~~~
sajal83
Athena looks cool. Didn't know about it. It probably describes what I'm trying
to do.

> Also why don't you dump the log data into a NoSQL like dynamoDB instead of
> S3 ?

Price.

~~~
eropple
Maybe you should take another look at pricing? DynamoDB is like
$0.002/GB/month more expensive than S3 for storage. Requests are more
expensive, but you're also not reinventing every wheel and if they're as
infrequent a set of requests as this seems designed for it's still only a few
bucks a month.

~~~
sajal83
My _infrequent_ reads and writes are huge, and possibly spikey.

> Write Throughput: $0.0065 per hour for every 10 units of Write Capacity
> (enough capacity to do up to 36,000 writes per hour)*

> A unit of Write Capacity enables you to perform one write per second for
> items of up to 1KB in size

As I understand it, for $4.68/month I can only add 36 MB/hour, and thats
assuming my objects are in exact multiple of 1KB.

~~~
eropple
Ahh, gotcha. That makes sense, this setup makes a lot more sense if you're
dealing with very spikey reads.

------
derefr
So, a database that's has a file format of write-once content-addressed
"shards" and is persisted/distributed by "offlining" those shards into object
storage, and then "onlining" them back into a given DB node's MRU cache at
query time; and is mostly read-only, but can update by downloading the
relevant shards, modifying them locally, re-hashing the modified shards to get
new names for them, and then uploading them again under those new names.

Is this basically equivalent to Datomic, then? (Not that that's a bad thing.
The world needs an open-source, non-JVM-targeted Datomic.)

~~~
reubano
> The world needs an open-source, non-JVM-targeted Datomic.)

Couchdb?

------
ddxv
What do you think about this being able to work with Gzip compressed JSONs
stored on S3? Cool project, thank you.

~~~
sajal83
Thanks.

I think flat JSON files wont be efficient. My goal is to have the cache on
disk, and each cached file would be big with lots of keys on it. In order to
use JSON files, I would either have to keep the whole parsed data in memory,
or parse the whole JSON each time I want to lookup a key.

If the data fits in memory then sure JSON is more convenient.

