

InfluxDB Clustering Design – Neither CP or AP - pauldix
http://influxdb.com/blog/2015/06/03/InfluxDB_clustering_design.html

======
ah-
Really interesting! Just to confirm I understood the basics correctly:

Write means append, to a timeseries which is keyed by a timestamp, and is
identified by some name? If a write succeeds only partially, different servers
have different data. And a read might return any of these versions. After some
time the anti-entropy repair will kick in, and merge the diverging timeseries.
Merging means taking the union of all data points.

Where do the timestamps come from, the client? So if a client retries a
partially successful write, it'll have the same timestamp and will be merged
during repair. Are timestamps within a timeseries monotonically increasing?

The hinted handoff sounds like it is motivated by a similar problem that the
Kafka in sync replica set tackles. Do you have any views on the pros/cons of
your approach via ISR sets? I think Kafka uses ZK for the ISR management which
means it wouldn't work with the availability requirements of InfluxDB, but
could a modified version work?

So overall InfluxDB is sacrificing lots of consistency for availability. Since
the CP part of the system is actually cached, the entire system is really AP?
If not, what parts are not AP? Modification of the CP part, like creation of
new timeseries?

From a users perspective I could see it being useful to have a historical part
of the timeseries that's guaranteed to be stable, and an in-flux part where
the system hasn't settled yet. Then one could run run expensive analytics on
the historical part, without having to recalculate everything on the next read
since the data could have changed since then. You're already hashing your data
and building a Merkle Tree, maybe that would make it possible to implement
something like that.

~~~
pauldix
Timestamps should mostly be supplied by the client. They can be present or in
the past, it doesn't matter.

If a write succeeds only partially, it will most likely be replicated up to
the other servers (and thus be consistent) by the hinted handoff queue. This
should be a fast recovery. Anti-entropy is for some much longer term failure
that needs to be resolved.

Our use of hinted handoff and our goals for that are just borrowed ideas from
Dynamo (the paper not the AWS DB), Cassandra, and Riak.

The issues around consistency are only for the failure cases. During normal
operation you'd see a consistent view (within a second or some small delta).

Mostly the system is AP, with some parts that are CP. But if you really
examine it, it's neither pure CP nor pure AP. It's some other thing.

~~~
marknadal
Present or past, what happens if the client's timestamp claims to be from the
future?

~~~
pauldix
That's fine too. However, queries by default set an end time of "now" on the
server. So depending on how far in the future the point is, it may not show up
in a query that doesn't have the end time explicitly specified to past the
future time of that point.

------
penprogg
What is the difference between this and Cassandra? A more powerful querying
language?

Cassandra already has Consistency Levels with Replication Strategies. I feel
like the only way to get powerful querying out of a system like this would be
to have a map reduce layer on top of your db which is what many do to get
powerful querying from cassandra.

~~~
pauldix
This is purpose built for this use case. The MapReduce system you talk about
is part of what's built in. Each aggregate function in the query language is
represented as a MapReduce job that gets run on the fly.

The other part of it is that we're optimized for this use case. I've built
"time series databases" on top of Cassandra before. It requires a great deal
of application level code and hacking to get things like retention policies
and continuous queries, which are built into InfluxDB.

------
pauldix
InfluxDB CEO and post author here. I'd love to hear feedback and answer any
questions.

~~~
schmichael
> Every server in the cluster keeps an in memory copy of this cluster
> metadata. Each will periodically refresh the entire metadata set to pick up
> any changes.

How does this avoid nodes doing stale reads from the inmemory copy resulting
in each node having a slightly different out-of-date view of the cluster?

~~~
lobster_johnson
I believe Raft is supposed to handle this; every modification is a log entry,
and every recipient has to ack the log entry, a bit like two-phased commit.

~~~
dcb18
This is incorrect, only reads from the current Raft master are guaranteed to
not be stale. In the case of InfluxDB, I think caching is safe because the
shard metadata is immutable.

~~~
schmichael
Where are you seeing that cluster metadata is immutable? I don't even know how
that would work. Surely nodes, databases, shards, users, permissions, etc. all
can change?

~~~
dcb18
Yep, you're right.

------
lucian1900
Looks quite nice and straightforward, but this is very clearly an AP system.
Most such systems use a CP component for cluster management.

~~~
pauldix
Can it be called a pure AP system if a long running network partition can
cause it to become unavailable?

In our case, shard groups are part of the data that is in the CP system. We
create these ahead of time, but those are what we use to determine where in
the cluster a given write should go and what servers we need to hit to answer
a query.

Let's call a "normal partition" one that is less than a few hours. In that
case I wouldn't expect it to cause the system to become unavailable. However,
there are certainly scenarios in which a longer partition would make it
unavailable.

And even for a "normal partition", new databases wouldn't be available to
write to.

------
seaworthy-tonya
> Being able to write and query the data is more important than having a
> strongly consistent view

I can imagine some use cases where this is a very reasonable assumption
(statsd-style analysis & monitoring systems) but other cases where it's not so
great (financial systems).

~~~
pauldix
That's why later on we'll be adding per query request consistency levels. For
now, we're focused on the AP use case where you don't need an absolute
guarantee.

------
chaotic-good
What is throughput of the system per node?

~~~
pauldix
hasn't been measured yet. That'll come in mid-to-late summer after we've
released 0.9.1 or 0.9.2.

Also, on the 0.9.2 or 0.9.3 release cycle we're going to start work on a new
storage engine that is custom built for this use case. That will have a
massive effect on the throughput.

~~~
chaotic-good
Can I read about this storage engine somewhere?

------
hendzen
How is this different than hadoop?

~~~
Xorlev
I'll assume that you're asking out of a genuine curiosity born from a lack of
knowledge of either system and not a question of "why was this made?"

Hadoop is a computing ecosystem. The Hadoop project is not only a computing
framework, but it's a datacenter work scheduler (YARN), a distributed
filesystem (HDFS), computing framework (MapReduce), HBase (database built on
top of HDFS) and a whole host of other complimentary technologies. Admittedly,
when most people say "Hadoop", they refer to MapReduce. MapReduce is a batch
computation framework principally for executing filtering or aggregation over
large amounts of data (e.g. finding top referrers from request logs).

InfluxDB is a distributed timeseries database. The closest analogue in the
Hadoop ecosystem would be HBase running OpenTSDB. InfluxDB is aiming to fill
the niche of high-volume metric collection and analysis. A system like
InfluxDB (or any other time-series storage solution) aims to observe data over
time for use in dashboarding, alerting, and general analysis over time. For
example, tracking pageviews per second or response times.

I encourage you to take a look at all these projects, they're fantastic when
you need them.

