
Ask HN: Using AWS S3 as a database - dazhbog
Hello,<p>I have a website that displays data from sensors. In the backend its a nodeJS server (doesn&#x27;t really matter) and I have been trialling mongoDB and mySQL. I haven&#x27;t been that happy with mySQL as it needs a lot of code to get it to play well with node. On the other hand we have mongo; its nice but I have no idea how it will scale in the future. Plus seeing how the HN community approaches mongo, I am exploring other options :).<p>Anyways.. to the question.<p>So I have been thinking to use S3 as a database, and write a nodeJS driver with some basic query abilities. I am willing to have a JSON request taking up to 500-700ms. But scale cheaply, easily to terabytes of data. I would add caching in the future but for now, reliable, scalable storage is what I am after.<p>I have done a preliminary benchmark between my server and S3.<p>* A file of 6.9MB takes an average 400MS with spikes of many seconds on the first requests).<p>* A file of 400Bytes takes around &lt;50MS, again with random spikes.<p>So, given the nice scaling nature of S3 would this be a good way of attacking the problem? what do you think??<p>I did research before posting :)<p>Cheers.
======
spotman
Hi,

It's not the worst idea, but here are some thoughts to consider:

* concurrency and locking

Databases are good at giving you a consistent view of data. Some are better at
it than others. Depending on your application this may be more or less
important to you. However, how often would it be possible for your application
to have multiple requests that come in. How would you deal with if 2 requests
write to the same file in s3? Which one will have the accurate representation
of your data?

To give an example, lets say (step1) request A comes in and pulls down this
json stored in s3. This request turns into a subsequent request (step2) to
store the data in s3 (lets say a value in the JSON needs to change.)
Meanwhile, another request session is started, request B that also pulls down
a copy of the data. Lets say request B does this in-between step1 and step2.
So then on step 2 of request B - you are potentially overwriting data that you
did not intend to.

* performance fluctuation

Having used S3 extensively, I can tell you that its performance varies. What
if your 400ms request goes through a 15 minute period where they are 4000ms
requests. Stranger things happen. S3 is an extremely reliable system in terms
of not losing your data, and generally working all the time, but performance
varies. With each request that takes longer than you think, resources in your
application (file descriptors, cpu burn, etc) pile up. If your application can
sustain itself comfortably with these upper levels of latency you would be
fine, but this may trigger downstream impatience with the users of your
application, causing them to refresh data even more.

* querying and features

S3 is just a key value store at the end of the day. Maybe that is enough for
your application. One day however, if you decide you want to know how many
items in your database have a certain value, or how many were created on a
certain date, you have no option but to really download the entire bucket
iteratively from S3.

S3 doesn't have a good way to even estimate the amount of data in it. Clients
that do this have to iterate over the contents of your bucket, and if one day
the data grows to 100GB, this is a slow task.

In short, S3 tends to lend itself better to data that doesn't have as much
concurrency issues, or data that is mostly static. However, if there is a will
there is a way. If it were me, I would be most worried about the lack of
database-like features, and secondly worried about the concurrency issues, but
it does depend on your application.

Cheers

~~~
dazhbog
Excellent points, thanks for sharing your knowledge.

------
pjungwir
I've used to S3 to store JSON objects. Two pain points I've noticed:

\- If your tests change the time (e.g. with Delorean for Ruby), S3 will fail
because the protocol depends on your client having approximately the same time
as the server.

\- If you ever want to load several S3 files at once, e.g. to show a list of
30 Foos, you'll need to make 30 requests. So this is a bit like an n+1
problem. There might be a way around this, but I haven't investigated it, and
it will probably require you to sidestep the abstractions you've built. I'd
say with 99.9% confidence you will want to do this someday.

These days I mostly use Postgres instead of MySQL, but I can't help but think
that querying MySQL from Node has _got_ to be easier than building your own
database on top of S3.

~~~
dazhbog
I am also thinking of using Postgre, I think the support is better (now with
native json support). You still need to compile the sql queries (in node) tho,
which is a pain. Another thing I remember was, when having a new db, you had
to get the tables set up and initialized, so more boilerplate code there :)

------
mak4athp
You don't realize it yet, but you're expressing a ton of anti-patterns here.
Not the least of which is an urge to prematurely optimize your project, and
desire to invent your own new solution to a broadly (but not universally)
solved problem.

Don't assume that you can anticipate where your scaling challenges are really
going to strike. MySQL and Mongo are both more than capable of supporting your
project for quite a while, and _after_ you collect more empirical data on your
projects growth and bottlenecks, you can start thinking about how to address
those problems.

~~~
dazhbog
Hi, thanks for your reply. I like your answer. I am aware of the programmer's
urge to _over-optimize "all the things"_ for the million requests per second.
I think I get that many times.

My issue is that, while learning, you want to make the best choice when it
comes to the users data. My take is if _one_ does not full understand a
component of the stack, _one_ takes precautions. For example, I might not be
the best on setting up a couple of mysql servers doing replication and
monitoring the whole lot. However, what one can do is use a service like RDS,
where it takes backups for you, and you can restore, or spin up a new
instances etc etc. Which is nice.

Having switched to Mongo a while ago (using Compose), I feel that I don't have
that much of control. So yeah, just experimenting I guess.

------
giaour
This could work, but it would make it impossible to query your data without
traversing every record in your DB.

If you want to store unstructured data but don't want to deal with scaling
pains in the future, try looking into hosted solutions.

\- Compose is pretty great for RethinkDB, Mongo, and ElasticSearch.

\- Amazon DynamoDB can scale like mad and supports secondary indexes,
arbitrary queries, and triggers (through Lambda).

~~~
dazhbog
Hi, thanks for your reply. I actually use Compose at the moment. I think its
pretty good! Just being a 3rd party solution makes me slightly uneasy :)

------
seahorse
Try S3fuse. It saves you from writing that driver and the caching.

[https://code.google.com/p/s3fuse/](https://code.google.com/p/s3fuse/)

------
czbond
Object store or database? The use cases you suggest are centered on storing
files - if that is your central use case - then S3 is designed for that, and
perfect.

------
NeutronBoy
Given you're looking at using S3, is there a particular reason you aren't
using DynamoDB? Or even RDS?

~~~
dazhbog
Hi, I am actually using RDS for some other projects. I haven't used DynamoDB.
I am using mongo at the moment. Thanks

