

Ask HN: designing for interactive querying of large-scale high dimensional data - decimusphostle

Hello Folks,<p>I am looking to get some inputs on architectural design on data storage and processing for something new I am working on.<p>I have large scale datasets (read a few hundred TB) which are used to generate data to power analytics. For the most part the analytics are usually data aggregations along different dimensions which are computed in an offline fashion (think Hive&#x2F;Pig&#x2F;Hadoop) and then fed-into&#x2F;stored in an RDBMS or an in-memory data-store. This allows for fairly low-latency&#x2F;highly responsive &#x27;online interactions&#x27; either via a web-page or an API.<p>However the new features that are being contemplated involves allowing users to slice and dice the data along various new dimensions. Previously since the dimensionality was low, the batch jobs would compute rollups of the cartesian products of various combinations of these dimensions. The &#x27;online interactions&#x27; could then be reduced to a mere lookup operation. This no longer seems viable with increased dimensionality and high cardinality.<p>So to get to the question, does anyone have any suggestions on new&#x2F;interesting data architectures&#x2F;data models&#x2F;solutions that would be suited for slicing and dicing large-scale datasets along different dimensions while keep interactivity high&#x2F;latencies low.<p>Looking forward to hearing from the resident data gurus. TIA.<p>PS:<p>1) HN n00b here. Apologies if I am doing anything wrong.
2) In the comments, there is an example provided (here: https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=7770551).
======
decimusphostle
Example/Fabricated scenarios below

A simplistic view of the old scenario: Assume something like Google Analytics
that tracks different websites traffic data. Previously a user could filter
various website stats(unique users, time spent by users on a site, number of
pageviews by user et al), by gender, by age, by device. So for example,
interact with the data using a filter that picks all males(3), that are in an
age bucket(assume 5 different age buckets) and that use a specific device
category(assume 15 different device catgories). Here the cartesian product
becomes 3(gender) X 5(age buckets) X 15 which would make it a maximum of 225
data-points per day, per website that is tracked.

And now the new scenario: Besides the number of dimensions going up i.e. more
multipliers in the cartesian product above, the way a filter works is being
tweaked as well. So previously where it was an AND across different
dimensions(eg. all males AND 25-34 AND Windows Tablet users), now it is an AND
between dimensions and an OR within a dimension(eg. all males AND [25-34 OR
35-44 OR 45-54] AND [Windows Tablet OR Windows Desktop] AND [Some fourth
dimension option 1 OR Some fourth dimension option 2]... ). This makes it a
combinatorial explosion i.e. its roughly 3! * 5! * 15! (even for the old
scheme).

~~~
cnasillo
Check the following;

GoSquared is a web analytics app that covers exactly the sort of scenarios you
mentioned.

In brief, they use Cassandra and Redis for their backend. Redis is their
middle layer for fast writes, sorting, aggregations (hourly, day, weekly,
monthly), while cassandra is their "persistence" layer where they store the
aggregations as a blob. Within Cassandra they optimize heavily on data
locality for fast lookups of 'contiguous' rows using partitioning and
clustering keys.

When they need to pull them back, they scan the key ranges and pull their blob
from Cassandra, then push it into Redis, which, with the various default data
types, does a lot of the heavy lifting.

You should watch the following presentation from them:

[https://engineering.gosquared.com/video-databases-tech-
gosqu...](https://engineering.gosquared.com/video-databases-tech-gosquared)

