

Ask HN: Help me to decide architecture for buidling my startup - starterkit

I am planning to start my startup and it is kind of social app. Currently I am researching approaches to build my app, but there are too much to consider, so I ask for your help to decide my architecture. (below knowledge is from my research on topic)
My main concern is storing data.
- Use RDBMS. PostgreSQL, because I need location based features so I may use PostGIS.
- use Couchbase or cassandra. Why? because doing activity stream with RDBMS maybe slow due to joins. In NoSQL, most of the time I will not be able to use joins, so I will design denormalized tables for activity streams (maybe I am wrong) Also couchbase seems fast, and it contains memcached inside, s caching layer and scaling layer will be there on install.
As far as I understand if you use couchbase you can retrieve data in json, so it will reduce converting data to json in backend app. side, where I am going to use pythn&#x2F;django.
- I need search so I am going to use elastic search, in that case couchbase can be a good fit, since they have a plugin for ES, which may boost development time. 
What would you recommend? (I have never created such thing before, so my questions may seem to stupid to you)
If you have any suggestion please suggest, maybe I am thinking too much about performance and scaling, maybe I should just stick with PostgreSQL at the beginning.
======
davismwfl
TLDR; Couchbase is cool but has some considerations so evaluate, Cassandra is
better spoken to by someone with experience there and PostgreSQL is still
great if you have a relational dataset -- start with what you know if it works
and go from there. \---

There is a huge number of factors that go into it, but I'll give you some
opinion. :)

If you know PostgreSQL and how to work with it today (and the others are new
to you), stay with it for now until you know it isn't the right tool. Trying
to learn a new methodology and tool while also starting a company and trying
to gain traction etc isn't always the best way to go. Also, think about
when/if you need to hire, how long will it take to find someone with
experience in X tool or to train someone on it.

As for Couchbase vs Cassandra vs PostgreSQL. All have their pro's and con's
and it will boil down to your use cases, dataset and complete tech stack (i.e.
some SDK's are less mature than others)

I have been a huge Couchbase fan and user for a few years now, going back to
membase. However, I'll be honest, while our current primary datastore is
Couchbase, we are moving away from it because of the amount of time we spend
solving issues that just shouldn't be. To get this out of the way, I love CB's
scale out ability and performance, it is stupid simple overall and works very
well -- Mongo could learn a few things about making the scale out process
easier from Couchbase (and I think they are). We also use Couchbase to
ElasticSearch, and it works pretty damn well, but again is still maturing. In
our recent evaluations we found we can replace ES for 60-70% of why we have to
use it simply by moving off Couchbase. That means I can reduce my ES
resources, to the 30-40% of use cases where it is needed and save some cash,
while still getting the same results and performance.

There are a number of things to consider when using CB as your datastore, and
while we are moving away from it, I think it is worth a solid look. However,
if you store a lot of documents that are small in size but you want keyed for
near instant access, Couchbase can cause you to need far more machine
resources than you really should (e.g. it gets expensive fast). This is
because every key + meta data (56 bytes for 2.2 I believe) must be stored in
your bucket RAM, and once the key+meta-data exceeds 50-60% of the available,
your in trouble in a few ways. So if you define the bucket to be 2gb, every
key+meta data must fit within roughly 50% of that (1gb). Of course, you can
keep scaling up/out to increase that size, but like I said costs start to
become a factor here. A fair rebuttal to that is to restructure the data so it
is larger values, smaller number of keys. However, now you run into a second
issue, while views are awesome we have seen they have quite a way to go to be
truly a final solution, and they have diminishing returns if you have too many
of them. So then the typical answer is you start merging views and returning
larger data sets and doing more and more work on the Couchbase client side
(API etc) to filter results. Not saying that is always bad, just something to
consider. Couchbase also limits you to no more than 10 buckets per cluster
(and in my experience more than 5 and your CPU utilization goes up pretty
well, so you need more CPU generally). Which means if you need document
segmentation, that is more than just a "type" field on a document, this can
quickly become an issue. Lastly, all of our API's are in node.js, and frankly
CB's node library has a way to go before it is really ready to work in a high
transaction way. We have found that it leaks memory when you have sustained
high transaction volumes (this is with node 0.10.22), so we have reverted to
writing a lot of larger tasks directly in C to get around it; while I actually
enjoy doing that, it is time-consuming and not an efficient use of our
bootstrapped resources. I read a lot of what the CB team is doing and I think
they are working hard to fix almost every one of my points, so just weigh your
entire stack first. And please don't consider this a bash against CB, it is
anything but, as I think their technology is pretty damn cool, it just has to
fit your use case properly like any technology.

As for Cassandra, I am no where near an expert or even a good novice here, so
someone else can give you the good/bad there. I do know from reading that it
has grown in favor quite a bit and the redundancy and reliability are quite
good. We just evaluated it and felt it would be a good solution, however we
had a hard time fitting our use case into it. I fully admit that may be our
own limitations more than Cassandra's.

PostgreSQL is great, especially if you have the need for highly relational
data. In general, I still would favor an RDBMS if your dataset is highly
relational. So this depends more on what your data looks like and how it gets
used. Performance is good when designed right, but hard to reach the
performance of Couchbase, although everything has a trade off. If I needed the
performance in places but my data was highly relational, I might look at using
Couchbase in front of the RDBMS as a persistent cache, this makes recovery
easier on the DB when there is a fault.

In the end its still all about your use case, dataset, tech stack and what you
need it to do.

~~~
starterkit
Thanks for great explanation of couchbases disadvantages :), I have read a lot
about couchbase and how it is good, but real production is not an ideal, so
thanks.

~~~
davismwfl
Anytime. Good luck with everything.

------
acesubido
Design your stack around where the organization is growing in a business
perspective: early stage start-ups are very fast organizations. You have to
design your stack in a way, early on, that supports failed assumptions, which
means frequently changing requirements at the same leaving enough room to
breathe and grow.

Scaling equates to specialization, you specialize in a specific area of your
stack because your startup has grown in that direction. In a very early stage,
there's not much growth only a period of intense validation, so don't
overthink about scale right now.

If you are using PostgreSQL, just stick with it. You can ship things faster
and troubleshoot better with stuff you know. In terms of fetching data, you
can design it in a way that you actually don't have to do any joins at all.
Twitter still uses MySQL up until this very day, they've customized the core
engine for their purposes. Point is: don't over-think about storage for now,
no one knows right now where your startup will grow into :)

Build and design a pleasurably usable RESTful HTTP API Server with a matching
client: in my experience this is very very helpful. At a very early stage,
building an API server allows you to pivot relatively quickly. When you have a
"proxy" for your database, it's practically developer-UX for fast changing
business requirements, and it will avoid "database code hell" ie, random
projects doing random things at your database.

Imagine you're building this huge web-app, but the users clearly want and need
a mobile equivalent. What if the users want some sort of on-site installation
for an enterprise version? Suddenly you're not a B2C startup and you'll be
going on B2B.

An API server helps you do tons of things that enables you to ship
applications faster and makes your startup very flexible since you can isolate
and maintain this very large part of your product.

~~~
starterkit
Thanks, by the way as an experienced developer what would be your stack if you
wanted to start your own. After reading comments here, posts in other places,
I am going to stick with Python/Django/PostgreSQL+PostGIS/ElasticSearch and
django-tastypie for REST backend, for activity stream maybe django-activity-
stream for the beginning, since I am only one member now :), I do not want to
reinvent bicycle.

------
alainkinwong
I can't comment on your storage stuff (we've had to develop our own to fit our
specific needs), but elastic search is a good option to start out with for
search. You can check out the Elasticsearch case studies here:
[http://www.elasticsearch.org/case-study/](http://www.elasticsearch.org/case-
study/)

~~~
starterkit
Can you tell me yours if possible and what kind of product you are building

