
Ask HN: How to handle 50GB of transaction data each day? (200GB during peak) - NietTim
Recently I switched jobs and joined some old employers and friends on a new endeavour, an advertisement platform. So far things have been very good, so good in fact that we&#x27;re quickly growing out of our current infrastructure. Last christmas everything went crazy and the, then only developer, rewrote a bunch of things to make it scale better.<p>Currently we write away every single transaction to txt files and once every X minutes gather those and start inserting them into 2 Mysql shards. After that we run a bunch of group by queries to generate cross table reports. Currently our CMS is very limited by this, you can only view data relations in the predefined tables and we would like to have (a lot) more flexibility here.<p>Perhaps the biggest issue of all is that both I and the other developer do not have experience with handeling this much data. We both have worked on systems which handle a fair bit of data, but not nearly as much as this. We&#x27;ve been looking at MapReduce for MongoDB but already run into problems when we try to insert a weeks worth of data, which is about 350gb, because we only get so many IOPS at AWS. If we want more IOPS, it gets really expensive really fast. We&#x27;re also not entirely sure yet if MongoDB and MapReduce is the way to go.<p>What I&#x27;m looking for is a few pointers to help us get headed in the right direction. Which databases should we be looking at? Should we be looking in an entirely different direction? Are there some (recent?) blog posts about this which can help us out?
======
ecaroth
Not an answer to your question, but just a quick note- this is the first post
in a long while on HN where I appreciate both the problem you are looking to
solve and the honesty/sincerity you have in saying that you are not perfectly
qualified to solve it but you know those here can help. From all of us on the
community watching and lurking, thanks for your candor so we can all learn
from this thread!

~~~
faizshah
I think shopping around and recommending/evaluating these tools is pretty fun
that's why these threads are always a blast.

------
haddr
First of all, 50GB per day is easy. Now, maybe contrary to what they say
below, do the following:

* Don't use queues. Use logs, such as Apache Kafka for example. It is unlikely to lose any data, and in case of some failure, the log with transactions is still there for some time. Also Kafka guarantees order of messages, which might be important (or not).

* Understand what is the nature of data and what are the queries that are made later. This is crucial for properly modeling the storage system.

* Be careful with the noSQL cool-aid. If mature databases, such as postgreSQL can't handle the load, choose some NoSQL, but be careful. I would suggest HBase, but your mileage may vary.

* NoSQL DBs typically limits queries that you might issue, so the modelling part is very important.

* Don't index data that you don't need to query later.

* If your schema is relational, consider de-normalization steps. Sometimes it is better to replicate some data, than to keep relational schema and make huge joins across tables.

* Don't use MongoDB

I hope it helps!

~~~
late2part
Here here ; +1.

Kafka = good. Postgres can do a TB; if you plan bigger than that consider
Cassandra.

Also consider protobuf instead of json.gz.

~~~
stuartaxelowen
Or... Arrow?! :O

JK it's too early :(

------
mattbillenstein
First of all, ingest your data as .json.gz -- line delimited json that's
gzipped -- chunk this by time range, perhaps hourly, on each box. Periodically
upload these files to the cloud -- S3 or Google CloudStorage, or both for a
backup. You can run this per-node, so it scales perfectly horizontally. And
.json.gz is easy to work with -- looking for a particular event in the last
hour? gunzip -c *.json.gz | grep '<id>' ...

Most of the big data tools out there will work with data in this format --
BigQuery, Redshift, EMR. EMR can do batch processing against this data
directly from s3 -- but may not be suitable for anything other than batch
processing. BigQuery and/or Redshift are more targeted towards analytics
workloads, but you could use them to saw the data into another system that you
use for OLAP -- MySQL or Postgres probably.

BigQuery has a nice interface and it's a better hosted service than Redshift
IMO. If you like that product, you can do streaming inserts in parallel to
your gcs/s3 uploading process for more real-time access to the data. The web
interface is not bad for casual exploration of terabytes of raw data. And the
price isn't terrible either.

I've done some consulting in this space -- feel free to reach out if you'd
like some free advice.

~~~
devbug
Sounds like you know your stuff!

I couldn't find any email on your profile, do you mind sharing it?

~~~
chejazi
Bottom of this page: [http://vazor.com/unicode/](http://vazor.com/unicode/)

------
nunobrito
We need to handle data in a similar level as you mention and also use plain
text files as only reliable medium to store data. A recent blog:
[http://nunobrito1981.blogspot.de/2016/02/how-big-was-
triplec...](http://nunobrito1981.blogspot.de/2016/02/how-big-was-triplecheck-
tech-in-2015.html)

My advice is to step away from AWS (because of price as you noted). Bare metal
servers are the best startup friend for large data in regards to performance
and storage. This way you avoid virtualized CPU or distributed file systems
that are more of a bottleneck than advantage.

Look for GorillaServers at
[https://www.gorillaservers.com/](https://www.gorillaservers.com/)

You get 40Tb storage with 8~16 cores per server, along with 30Tb of bandwidth
included for roughly 200 USD/month.

This should remove the IOPS limitation and provide enough working space to
transform the data. Hope this helps.

~~~
look_lookatme
Are you using GorillaServers solely for data harvesting/crunching or for all
of your infrastructure?

------
harel
Here are a few suggestions based on 6+ years in adtech (which have just came
to a close, never again thank you):

* Use a Queue. RabbitMQ is quite good. Instead of writing to files, generate data/tasks on the queue and have them consumed by more than one client. The clients should handle inserting the data to the database. You can control the pipe by the number of clients you have consuming tasks, and/or by rate limiting them. Break those queue consuming clients to small pieces. Its ok to queue item B on the queue while processing item A.

* If you data is more fluid and changing all the time, and/or if it comes in JSON serializable format, consider switching to postgresql ^9.4, and use the JSONB columns to store this data. You can index/query those columns and performance wise its on par (or surpasses) MongoDB.

* Avoid AWS at this stage. like commented by someone here - bare metal is a better friend to you. You'll also know exactly how much you're paying each month. no surprises. I can't recommend Softlayer enough.

* Don't over complicate things. If you can think of a simple solution to something - its preferable than the complicated solution you might have had before.

* If you're going the queue route suggested above, you can pre-process the data while you get it in. If its going to be placed into buckets, do it then, if its normalised - do it then. The tasks on the queue should be atomic and idempotent. You can use something like memcached if you need your clients to communicate between eachother (like checking if a queue item is not already processed by another consumer and thus is locked).

~~~
unknownknowns
> * If you data is more fluid and changing all the time, and/or if it comes in
> JSON serializable format, consider switching to postgresql ^9.4, and use the
> JSONB columns to store this data. You can index/query those columns and
> performance wise its on par (or surpasses) MongoDB.

Would it be worthwhile to switch from MongoDB to Postgres (when MySQL is being
used alongside Mongo)? I'd have to do lots of testing with it to know for sure
though, I guess.

I use Mongo for timeseries-like data, and have >1 billion documents in Mongo
currently. Querying super old data is pretty slow (and not needed generally),
but querying recent data (recent being the last month) is "fast enough".

~~~
harel
If your setup makes sense for you, its fine - stay with it. focus on real
problem, not ones people on this thread think you have :)

I'm not one of those mongo bashers. I've used mongo before for the same
purpose you do - time series statistical data along side a postgres db for
everything else. And it worked fine for me. However, at that time postgresql
was pre 9.4. Today I would have kept it all in postgres and simply archived
older data if needed. And to be fair, its not even a "problem" until you need
to filter by one data source and sort by the other. We had to be creative when
we got to those problems.

As for my choice of Postgres over mySql - that is for 2 reasons - first being
that postgres is not owned by Oracle, and second that postgres is more mature,
feature rich and reliable than mysql (i.e., better).

------
TheIronYuppie
Disclaimer: I work at Google.

Have you looked at Google at all? Cloud Bigtable runs the whole of Google's
Advertising Business and could scale per your requirements.

[https://cloud.google.com/bigtable/docs/](https://cloud.google.com/bigtable/docs/)

~~~
ukd1
This is super easy to get started with, plus really cheap storage wise. Query
wise, it can get "expensive" (tbh, still cheap) if you do it badly. It's
really fast.

------
lazyjones
I'm not sure I understand precisely what kind of data you are processing and
in what way, but it sounds like a PostgreSQL job on a beefy server (lots of
RAM) with SSD storage. Postgres is very good at complex queries and concurrent
write loads and if you need to scale quickly beyond single server setups, you
can probably move your stuff to Amazon Redshift with little effort. Wouldn't
recommend "big data" i.e distributed setups at that size yet unless your
queries are extremely parallel workloads and you can pay the cost.

In my previous job we processed 100s of millions of row updates daily on a
table with much contention and ~200G size and used a single PostgreSQL server
with (now somewhat obsoleted by modern PCIe SSDs) TMS RamSAN storage, i.e.
Fibre-Channel based Flash. We had some performance bottlenecks due to many
indexes, triggers etc. but overall, live query performance was very good.

------
zengr
Doing real time query for report generation for data growing by 50gb per day
is a hard problem.

Realistically, this is what I would do (I work on something very similar but
not really in adtech space):

1\. Load data in text form (assuming it sits in S3) inside hadoop (EMR/Spark)

2\. Generate reports you need based on your data and cache them in mysql RDS.

3\. Serve the pre-generated reports to your user. You can get creative here
and generate bucketed reports where user will fill its more "interactive".
This approach will take you a long way and when you have time/money/people,
maybe you can try getting fancier and better.

Getting fancy: If you truly want near-real time querying capabilities I would
looks at apache kylin or linkedin pinot. But I would stay away from those for
now.

Bigtable: As someone pointed out, bigtable is good solution (although I
haven't used it) but since you are on AWS ecosystem, I would stick there.

------
wsh91
We're having a good time with Cassandra on AWS ingesting more than 200 GiB per
day uncompressed. I don't know how you're running your IOPS numbers, but
consider allocating large GP2 EBS volumes rather than PIOPS--you'll get a high
baseline for not that much money. The provisos you'll see about knowing how
you expect to read before you start writing are absolutely true. :)

(Hope that might be helpful! A bunch of us hang out on IRC at #cassandra if
you're curious.)

------
yuanchuan
I once worked on similar project. Each day, the amount of the data coming in
is about 5TB.

If your data are event data, e.g. User activity, clicks, etc, these are non-
volatile data which should preserve as-is and you want to enrich them later on
for analysis.

You can store these flat files in S3 and use EMR (Hive, Spark) to process them
and store it in Redshift. If your files are character delimited files, you can
easily create a table definition with Hive/Spark and query it as if it is a
RDBMS. You can process your files in EMR using spot instances and it can be as
cheap as less than a dollar per hour.

------
mindcrash
You probably might want to read this (for free):
[http://book.mixu.net/distsys/single-
page.html](http://book.mixu.net/distsys/single-page.html)

And pay a little to read this book: [http://www.amazon.com/Designing-Data-
Intensive-Applications-...](http://www.amazon.com/Designing-Data-Intensive-
Applications-Reliable-Maintainable/dp/1449373321)

And this one: [http://www.amazon.com/Big-Data-Principles-practices-
scalable...](http://www.amazon.com/Big-Data-Principles-practices-
scalable/dp/1617290343)

Nathan Marz brought Apache Storm to the world, and Martin Kleppmann is pretty
well known for his work on Kafka.

Both are very good books on building scalable data processing systems.

------
alexanderdaw
1\. Stream your data into Kafka using flat JSON objects. 2\. Consume your
kafka Feeds using a Camus Map Reduce job (a library from linked in that will
output hdfs directories with the data). 3\. Transform the hdfs directories
into usable folders for each vertical your interested in, think of each output
directory as an individual table or database. 4\. Use HIVE to create an
"external table" that references the transformed directories. Ideally your
transformation job will create merge-able hourly partition directories.
Importantly you will want to use the JSON SERDE for your hive configuration.
5\. Generate your reports using hive queries.

This architecture will get you to massive, massive scale and is pretty
resilient to spikes in traffic because of the Kafka buffer. I would avoid
Mongo / mysql like the plague in this case, a lot of designs focus on the real
time aspect for a lot of data like this, but if you take a hard look at what
you really need, its batch map reduce on a massive scale and a dependable
schedule with linear growth metrics. With an architecture like this deployed
to AWS EMR (or even kinesis / s3 / EMR) you could grow for years. Forget about
the trendy systems, and go for the dependable tool chains for big data.

------
lafay
We faced a very similar problem when we started Kentik two years ago, except
in our case the "transactions" are network traffic telemetry that we collect
from our customers' physical network infrastructure, and providing super-fast
ad hoc queries over that data is our core service offering.

We looked at just about every open source and commercial platform that we
might use as a backend, and decided that none were appropriate, for scale,
maturity, or fairness / scheduling. So we ended up building, from scratch,
something that looks a bit like Google's Dremel / BigQuery, but runs on our
own bare metal infrastructure. And then we put postgres on top of that using
Foreign Data Wrappers so we could write standard SQL queries against it.

Some blog posts about the nuts and bolts you might find interesting:

[https://www.kentik.com/postgresql-foreign-data-
wrappers/](https://www.kentik.com/postgresql-foreign-data-wrappers/)

[https://www.kentik.com/metrics-for-
microservices/](https://www.kentik.com/metrics-for-microservices/)

If we were starting today, we might consider Apache Drill, although I haven't
looked at the maturity and stability of that project recently.

------
asolove
Read "Designing data intensive applications"
([http://dataintensive.net/](http://dataintensive.net/)), which is an
excellent introduction to various techniques for solving data problems. It
won't specifically tell you what to do, but will quickly acclimate you to
available approaches and how to think about their trade offs.

~~~
rantanplan
Although an awesome book, as I've bought it, maybe it would be useful to
inform any interested buyers that not all chapters are ready yet. i.e. it
hasn't been released officially and it doesn't have a final release date.

Right now the first 9(out of 12) chapters are finished and they're really
insightful.

------
jamiequint
Consider using CitusData to scale out Postgres horizontally. You can shard by
time and basically get linear speedup based on # of shards. Its extremely fast
and will be open source in early Q2 I think. You then can put your Postgres
instances on boxes with SSD instead of IOPS. Writes also scale mostly
linearly.

------
pklausler
50GiB/day is less than a megabyte per second. Surely you wouldn't be
bandwidth-limited on a real device, even consumer SSDs are in the 100-600
MiB/s range IIRC. Can you do anything to increase your bytes per IOP in your
current environment if you're IOP-limited?

------
exacube
If your data is growing at this rate (and you plan to keep this data around),
you'd want a _distributed database_ that can scale to terabytes. But it might
be overkill if you dont care for data consistency (i.e., you dont need to read
it "right away" after you do a write):

If you just want reports (and are okay getting them in the matter of minutes),
then you can continue storing them in flat files and using apache HIVE/PIG-
equivalent software (or whatever equivalent is hot right now, im out of date
on this class of software).

If you want a really good out-of-box solution for storage + data processing,
google cloud products might be a really good bet.

------
agnivade
Lots of good suggestions here. I won't say anything new but just wanted to
stress on the data ingestion part.

DO NOT write to txt files and read them again. This is unnecessary disk IO and
you will run into a lot of problems later on. Instead, have an agent which
writes into Kafka (like everyone mentioned), preferably using protobuff.

Then have an aggregator which does the data extraction and analysis and puts
them in some sort of storage. You can browse this thread to look for and
decide what sort of storage is suitable for you.

------
ermack
Its difficult to give answer without understanding of data processing you
want.

If you need to generate rich multi-dimension reports I recommend you create
ETL pipeline into star-like sharded database (ala OLAP).

Dimensions normalization sometime dramatically reduce data volume, most of
dimensions even can fit into RAM.

Actually 200Gb per day not so much in terms of throughput, you can manage it
pretty well on PostgreSQL cluster (with help of pg_proxy). I think mySQL will
also work OK.

Dedicated hardware will be cheaper then AWS RDS.

------
matt_s
Hire a DBA and stick with relational DB. We had (previous group, still
running) a 40-50TB Oracle db that collected much more data daily.

I would think you shouldn't be writing to a txt file, then doing imports into
a DB. That's hitting disk multiple times and unnecessary. Relational DB's
should handle 50GB of inserts in real time. IF you stream the data real time
into the DB then you won't need to worry about your batch of 50GB taking
longer time than you have to update.

Other things I have observed from very large relational dbs: have one database
for data capture and keep everything normalized. Your problem down the line is
going to be purging data so if you can, partition tables by date so you can
just drop partitions later. Setup a separate database for reporting purposes
and import necessary data - possibly consolidated metrics, etc. into this per
your business needs. Make sure your DBA understands the business needs of
each, tunes the databases for data capture and reporting and is involved with
specifying table layouts, etc.

My rationale for sticking with RDBMS - you can solve this problem with that
tech, the candidate pool is larger and the depth of knowledge, documentation
and best practices on the technology is much deeper than NoSQL alternatives.

------
mslot
disclaimer: I work for Citus Data

The bottleneck is usually not I/O, but computing aggregates over data that
continuously gets updated. This is quite CPU intensive even for smaller data
sizes.

You might want to consider PostgreSQL, with Citus to shard tables and
parallelise queries across many PostgreSQL servers. There's another big
advertising platform that I helped move from MySQL to PostgreSQL+Citus
recently and they're pretty happy with it. They ingest several TB of data per
day and a dashboard runs group-by queries, with 99.5% of queries taking under
1 second. The data are also rolled up into daily aggregates inside the
database.

There are inherent limitations to any distributed database. That's why there
are so many. In Citus, not every SQL query works on distributed tables, but
since every server is PostgreSQL 9.5, you do have a lot of possibilities.

Looking at your username, are you based in the Netherlands by any chance? :)

Some pointers:

\- How CloudFlare uses Citus: [https://blog.cloudflare.com/scaling-out-
postgresql-for-cloud...](https://blog.cloudflare.com/scaling-out-postgresql-
for-cloudflare-analytics-using-citusdb/)

\- Overview of Citus: [https://citus-
conferences.s3.amazonaws.com/pgconf.ru-2016/Ci...](https://citus-
conferences.s3.amazonaws.com/pgconf.ru-2016/CitusDB.pdf)

\- Documentation: [https://www.citusdata.com/documentation/citusdb-
documentatio...](https://www.citusdata.com/documentation/citusdb-
documentation/)

------
foxbarrington
Here's what I've done for ~200GB/day. Let's pretend you have server logs with
urls that tell you referrer and whether or not the visit action was an
impression or a conversion and you want stats by "date", "referrer domain",
"action":

* Logs are written to S3 (either ELB does this automatically, or you put them there)

* S3 can put a message into an SQS queue when a log file is added

* A "worker" (written in language of your choice running on EC2 or Lambda) pops the message off the queue, downloads the log, and "reduces" it into grouped counts. In this case a large log file would be "reduced" to lines where each line is [date, referrer domain, action, count] (e.g. [['2016-02-24', 'news.ycombinator.com', 'impression', 500], ['2016-02-24', 'news.ycombinator.com', 'conversion', 20], ...]

* The reduction can either be persisted in a db that can handle further analysis or you reduce further first.

------
stuartaxelowen
Check out LinkedIn's posts about log processing [0] and Apache Kafka. Handling
data as streams of events lets you avoid spikey query-based processing, and
helps you scale out horizontally. Partitioning lets you do joins, and you can
still add databases as "materialized views" for query-ability. Add Secor to
automatically write logs to S3 so you can feel secure in the face of data
loss, and use replication of at least 3 in your Kafka topics. Also, start with
instrumentation from DataDog or NewRelic from the start - it will show you the
performance bottlenecks.

0: [https://engineering.linkedin.com/distributed-systems/log-
wha...](https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying)

------
bio4m
If youre on a tight budget and IO is your main bottleneck it may be easier to
purchase a number of decent spec desktop PC's with multiple SSD's in them.
SSD's have really come down in price while performance and capacity have
improved greatly. Same goes for RAM. (Assumption here is that time is less of
a concern than cost at the moment and youre not averse to doing some devops
work. Also assuming that the processing youre talking about is some sort of
batch processing and not realtime) This way you can try a number of different
strategies without blowing the bank on AWS instances (and worst case you have
a spare workstation)

------
libx
I would consider Unicage for your demands.
[https://www.youtube.com/watch?v=h_C5GBblkH8](https://www.youtube.com/watch?v=h_C5GBblkH8)
[https://www.bsdcan.org/2013/schedule/attachments/244_Unicage...](https://www.bsdcan.org/2013/schedule/attachments/244_UnicageDevelopment.pdf)

In a shell (modified for speed and ease of use) get, insert, update data in a
simple way, without all the fat from other mainstream (Java) solutions.

------
batmansmk
We love those projects at my company (Inovia Team). Your load is not that big.
You won't make any big mistake stack-wise, you just have to pick something you
already have operated before in production at a smaller scale. Mysql,
Postgres, Mongodb, Redis will be totally fine. We have a training on how to
insert 1M lines a second with off-the-shelf free open source tools (SQL and
NoSQL). Ping us if you are interested by getting the slidedeck.

Tip: focus on how to backup and restore first, the rest will be easy!

------
nickpeterson
Does the database grow 50GB or is that the size of the text files?

~~~
nickpeterson
As a DBA, I'm duty-bound to say this, though I'm sure it will be drowned out
by calls to 'Denormalize All the Things!'. DON'T denormalize for performance.
It is a siren's call. Don't take my word for it, see anything written by
Fabian Pascal, Hugh Darwen, Chris Date.

If I was faced with a dataset of 50GB in a text file, being loaded into a
database every day, I would normalize as much as possible to ensure that I'm
not adding 50GB to a database everyday. What's the plan, to have a 20TB
database in one year? No, instead I would focus on micro-optimizations at the
database level.

I would inspect the datatypes you're storing your data in. If it's a value
with a limited domain (an enum in programming parlance), make sure you
determine the theoretical maximum amount of values it could be, and use the
equivalent numeric value to serve as a surrogate key in a separate table. If I
know a value is never going to exceed a few thousand distinct values, I could
use a smallint, which only uses 2 bytes, verses recording a char/varchar which
is going to probably burn at least 4-5 bytes every row, likely much more.

I believe ordering of columns matters in Postgres, so consult the
documentation/Stackoverflow where swapping around column ordering can
sometimes save several bytes per row.

You'll notice quickly that you may have repeating 'groups' of similar values
across your surrogate keys. For instance, if you had a StateID and a CountryID
in the same table, chances are good the combinations of StateID and CountryID
are finite, and could themselves go into their own table, which has its own
surrogate key, so instead of storing two surrogates on every row, you could
use only one. (Incidentally, chances are even better that CountryID should
just be a reference on the State table itself, but this is just to illustrate
the point).

Now imagine you have sets of attributes that describe whole subsets of the
information in your tables that are mutually exclusive. You could make two
different tables with identical data structures. Then placing a row in one
'means' those other implicit value combinations. Thus you record information
without actually storing it. The price here is on retrieval, you have to union
the tables if you want to consider the whole set. That said, you may shave a
large amount of storage. Views could be created to make this transparent to
end users.

Don't use an auto-incrementing index on the table, instead store a value that
is iterated separately, and 'take' ranges of values at a time that get cached
by routines that mean to load the data.

Through this style of process, one can often turn a 1TB table into something a
fraction of that size, at the cost of much harder to write queries. That said,
there are other benefits, suddenly you can load values into the supporting
tables first, then cache those values and load the main dataset without doing
lookups into the additional related tables. The 'growth' of a database will
slow considerably. Then you just preprocess the files, and 'COPY' them in,
which should allow you load them on the order of minutes, not hours.

------
ljw1001
If the big issue is querying the data, consider redshift (expensive) or a
self-hosted column-store database. Data will need to be loaded in a batch for
this. Column stores will reduce iops through compression, and selective data
loading, and because they don't have persistent indexes.

To save IOPS on the early part of the process, consider using fast encryption
(lz4 or snappy) to compress the records before writing to the file system.
This might cut your IOPS in half.

------
pentium10
Use BigQuery, here is a nice presentation how to get going and some uses cases
that get's you very familiar in the territory. I offer consulation also so you
can reach out. [http://www.slideshare.net/martonkodok/complex-realtime-
event...](http://www.slideshare.net/martonkodok/complex-realtime-event-
analytics-using-bigquery-crunch-warmup)

------
i_don_t_know
I don't know what I'm talking about or what you need, but I hear kdb is
popular in the financial industry because supposedly it can handle large
amounts of real-time financial information. [http://kx.com](http://kx.com)

------
jacques_chester
Compare pricing on RDS, if doing it yourself is hurting.

AWS also has Kinesis, which is deliberately intended to be a sort of event
drain. Under the hood it uses S3 and they publish an API and an agent that
handles all the retry / checkpoint logic for you.

~~~
takno
Absolutely this. If you're in the AWS ecosystem then move away from the bare-
ish metal and onto the tools they provide for doing this stuff. MongoDB in
particular really doesn't seem to be relevant to the problem at all. Throw all
the data into Kinesis, and then push the bits you need into Redshift the other
end.

If you want to keep a lot of data for ad-hoc queries but only use a subset for
regular queries then consider setting up separate dense compute and dense
storage Redshift clusters as appropriate. As far as pricing goes there's a
hell of a discount for going reserved once you've got things set up the way
you want them.

Once you've got that running and the immediate fires are out you may want to
look at Spark to pre-compute some of the reports (you can still push the
computed data into Redshift tables to save changing the reporting too much),
although I don't think spark as a service isn't really there in AWS right now,
and running up an EMR cluster is seriously going to cost you.

Bigtable provides similar capabilities to Redshift in the google ecosystem,
and looks okay although I wasn't blown away by it.

------
hoodoof
I'd start by asking if you are solving the right problem.

Does the business really need exactly this? What is their actual goal? Are
they aware of the effort and resources required to get this report?

------
coryrobinson42
I would highly recommend looking into Elasticsearch. Clustering and
scalability are its strong points and can help you with your quest.

------
ninjakeyboard
Look at your current solution and check the run plan of your sql. If your data
is indexed correctly it shouldn't be too too bad to execute queries. 1M
records is about 20 ops to search for a record by key.

If it's modelled in SQL, it's probably relational and normalized so you'll be
joining together tables. This balloons the complexity of querying the data
pretty fast. Denormalizing data simplifies the problem so see if you can get
it into a K/V instead of or relational database. Not saying relational isn't a
fine solution - even if you keep it in mysql, denormalizing will benefit the
complexity of querying it.

Once you determine if you can denormalize, you can look at sharding the data
so instead of having the data in one place, you have it in many places and the
key of the record determines where to store and retrieve the data. Now you
have the ability to scale your data horizontally across instances to divide
the problem's complexity by n where n is the number of nodes.

Unfortunately the network is not reliable so you suddenly have to worry about
CAP theorem and what happens when nodes become unavailable so you'll start
looking at replication and consistency across nodes and need to figure out
with your problem domain what you can tolerate. Eg bank accounts have
different consistency requirements than social media data where a stale read
isn't a big deal.

Aphyr's Call Me Maybe series has reviews of many datastores in pathological
conditions so read about your choice there before you go all in (assuming you
do want to look at different stores.) Dynamo style DB's like riak are what I
think of immediately but read around - this guy is a wizard.
[https://aphyr.com/tags/Jepsen](https://aphyr.com/tags/Jepsen)

AWS has a notorious network so really think about those failure scenarios. Yes
it's hard and the network isn't reliable. Dynamo DBs are cool though and fit
the big problems you're looking at if you want to load and query it.

If you want to work with the data, the Apache Spark is worth looking at. You
mention mapreduce for instance. Spark is quick.

It's sort of hard because there isn't a lot of information about the problem
domain so I can only shoot in the dark. If you have strong consistency needs
or need to worry more about concurrent state across data that's a different
problem than processing one record at a time without a need for consistent
view of the data as a whole. The latter you can just process the data via
workers.

But think Sharding to divide the problem across nodes, Denormalization eg via
Key/Value lookup for simple runtime complexity. But start where you are - look
at your database and make sure it's very well tuned for the queries you're
making.

Do you even need to load it into a db? You could distribute the load across
clusters of workers if you have some source that you can stream the data from.
Then you don't have to load and then query the data. Depends heavily on your
domain problem. Good luck. I can email you to discuss if you want - I just
don't want to post my email here. Data isn't so much where I hand out as much
as processing lots of things concurrently in distributed systems is so others
may have better ideas who have gone through similar items.

There are some cool papers like the Amazon Dynamo paper and I read the Google
Spanner paper the other day (more globally oriented and around locking and
consistency items). You can see how some of the big companies are formalizing
thinking by reading some of the papers in that space. Then there are
implementations you can actually use but you need to understand them a bit
first I think.

[http://www.allthingsdistributed.com/files/amazon-dynamo-
sosp...](http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf)
[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-
osdi06.pdf)

------
faizshah
Note: This is based on solutions I have been researching for a current project
and I haven't used these in production.

Short answer: I think you're looking in the wrong direction, this problem
isn't solved by a database but a full data processing system like Hadoop,
Spark, Flink (my pick), or Google Cloud's dataflow. I don't know what kind of
stack you guys are using (imo the solution to this problem is best made
leveraging java) but I would say that you could benefit a lot from either
using the hadoop ecosystem or using google cloud's ecosystem. Since you say
that you are not experienced with that volume of data, I recommend you go with
google cloud's ecosystem specifically look at google dataflow which supports
autoscaling.

Long answer: To answer your question more directly, you have a bunch of data
arriving that needs to be processed and stored every X minutes and needs to be
available to be interactively analyzed or processed later in a report. This is
a common task and is exactly why the hadoop ecosystem is so big right now.

The 'easy' way to solve this problem is by using google dataflow which is a
stream processing abstraction over the google cloud that will let you set your
X minute window (or more complex windowing) and automatically scale your
compute servers (and pay only for what you use, not what you reserve). For
interactive queries they offer google bigquery, a robust SQL based column
database that lets you query your data in seconds and only charges you based
on the columns you queried (if your data set is 1TB but the columns used in
your query are only some short strings they might only charge you for querying
5GB). As a replacement for your mysql problems they also offer managed mysql
instances and their own google bigtable which has many other useful features.
Did I mention these services are integrated into an interactive ipython
notebook style interface called Datalab and fully integrated with your
dataflow code?

This is all might get a little expensive though (in terms of your cloud bill),
the other solution is to do some harder work involving the hadoop ecosystem.
The problem of processing data every X minutes is called windowing in stream
processing. Your problems are solved by using Apache Flink, a relatively easy
and fast stream processing system that makes it easy to set up clusters as you
scale your data processing. Flink will help you with your report generation
and make it easy to handle processing this streaming data in a fast, robust,
and fault tolerant (that's a lot of buzz words) fashion.

Please take a look at the flink programming guide or the data-artisans
training sessions on this topic. Note that the problem of doing SQL queries
using flink is not solved (yet) this feature is planned to be released this
year. However, flink will solve all your data processing problems in terms of
the cross table reports and preprocessing for storage in a relational database
or distributed filesystem.

For storing this data and making it available you need to use something fast
but just as robust as mysql, the 'correct' solution at this time if you are
not using all the columns of your table is using a columnar solution. From
googles cloud you have bigquery, from the open source ecosystem you have
drill, kudu, parquet, impala and many many more. You can also try using
postgres or rethinkdb for a full relational solution or HDFS/QFS + ignite +
flink from the hadoop ecosystem.

For the problem of interactively working with your data, try using Apache
Zeppelin (free, scala required I think) or Databricks (paid but with lots of
features, spark only i think). Or take the results of your query from flink or
similar and interactively analyze those using jupyter/ipython(the solution I
use).

The short answer is, dust off your old java textbooks. If you don't have a
java dev on your team and aren't planning on hiring one, the google dataflow
solution is way easier and cheaper in terms of engineering. If you help I do
need an internship ;)

If you want to look at all the possible solutions from the hadoop ecosystem
look at:
[https://hadoopecosystemtable.github.io/](https://hadoopecosystemtable.github.io/)

For google cloud ecosystem it's all there on their website.

Happy coding!

Oops, it seems I left out ingestion, I would use kafka or spring reactor.

P.S The flink mailing list is very friendly, try asking this question there.

