
A Distributed Real-Time Data Store with Flexible Deduplication - prospero
https://amplitude.com/blog/2017/01/18/nova-distributed-real-time-data-store/?a=b
======
alexatkeplar
There are other reasons for duplicates in event streams - not just the dupes
introduced by at-least once processing in Kinesis or Kafka workers. We've done
a lot of thinking about this (all open-source) at Snowplow, this is a good
starting point:

[http://snowplowanalytics.com/blog/2015/08/19/dealing-with-
du...](http://snowplowanalytics.com/blog/2015/08/19/dealing-with-duplicate-
event-ids/)

Our last release started to tackle dupes caused by bots, spiders and dodgy
UUID algos:

[http://snowplowanalytics.com/blog/2016/12/20/snowplow-r86-pe...](http://snowplowanalytics.com/blog/2016/12/20/snowplow-r86-petra-
released/#synthetic-dedupe)

~~~
andrsncpr
Hi, Jin here from Amplitude. You are absolutely right that there are other
sources of duplicates. Our real-time data store sits behind an event processor
(not covered in this blog) that handles all major event duplication scenarios.
This is why the real-time store focuses on duplications introduced by the
message bus replays, something that systems such as Druid do not address.

------
csears
I would be curious to know if they evaluated any cloud-based data stores or
streaming services from AWS or GCP before deciding to building this from
scratch. It seems like a common set of requirements for event analytics
pipelines.

~~~
andrsncpr
Hi, Jin here from Amplitude. The real-time data store is part of a bigger
columnar store we built last year called Nova
([https://amplitude.com/blog/2016/05/25/nova-architecture-
unde...](https://amplitude.com/blog/2016/05/25/nova-architecture-
understanding-user-behavior/)). In designing Nova, we’ve looked at many
existent solutions including Amazon Redshift and Google BigQuery, but none of
them sufficiently supports all our use cases. You can read more in the linked
blog.

~~~
tlarkworthy
I read that and your motivations for building nova align very well with
bigquery. E.g. immutabled (big query was append only), felaxability (break out
of SQL with dataflow).

------
julienmarie
It reminds me exactly of the common architecture pattern of KDB/Q. Still at
this point, it's a marvel of tech.

