
Ask HN: How can I spend $1000 of AWS credit on learning about HDFS performance? - jvns
I want to spend a week better understanding performance (probably of things in the Hadoop ecosystem) on a few different dataset sizes (8GB, 100GB, 1TB). I have $1000 of AWS credit that I can spend on this (yay!)<p>Some things I want:<p>* get a much better grasp on the performance of in-memory operations (put 8GB of data into memory and be done) vs running a distributed map reduce.<p>* Understand what goes into the performance (how much time is spent copying data? sending data over the network? CPU?)<p>* Learn something about tradeoffs<p>I&#x27;d love suggestions for experiments to run and setups to use. At work I&#x27;ve been using HDFS &#x2F; Impala &#x2F; Scalding, so my current thought is to spend time looking in depth at running a map&#x2F;reduce with Scalding vs an Impala query vs running a non-distributed job in memory, because I already know about those things. But I&#x27;m open to other ideas!<p>Some questions I need to answer:<p>* Are there good large open datasets I could use? I&#x27;d like to use real data because it&#x27;s more fun.<p>* If you were going to try to make reproducible experiments, where would you start?<p>* How can I set up an environment without spending an entire week on it?<p>* How can I make installing everything as easy as possible?<p>* I have $1000 of AWS credit to spend. How much should I budget? What machines should I spend it on?
======
bra-ket
try YCSB benchmark :
[https://github.com/brianfrankcooper/YCSB](https://github.com/brianfrankcooper/YCSB)

dfsio: [https://support.gopivotal.com/hc/en-
us/articles/200864057-Ru...](https://support.gopivotal.com/hc/en-
us/articles/200864057-Running-DFSIO-mapreduce-benchmark-test)

note: EBS IO performance used to be abysmal, especially for cheaper instances,
use storage which comes native with the EC2 instances vs network-based, see
AWS FAQ: [http://www.datadoghq.com/wp-
content/uploads/2013/07/top_5_aw...](http://www.datadoghq.com/wp-
content/uploads/2013/07/top_5_aws_ec2_performance_problems_ebook.pdf)

if you're interested in scalable in-memory computing (online vs batch) try

Storm:
[http://storm.incubator.apache.org/](http://storm.incubator.apache.org/)

Spark: [http://spark.apache.org/](http://spark.apache.org/)

Phoenix:
[http://phoenix.incubator.apache.org/](http://phoenix.incubator.apache.org/)

------
willejs
* How can I set up an environment without spending an entire week on it? \- Cloud formation would be a good bet here

* How can I make installing everything as easy as possible? \- Use hosted chef and community cookbooks

These suggestions might be pretty daunting if you havent used chef or
autoscaling/cloudformation before. Alternatively you could hack this all and
bake an ami and clone it.

Look at using something with a higher granularity of metics than cloudwatch
(resolution is 5 minutes), like graphite and collectd to collect stats easily.

