
Moving product recommendations from Hadoop to Redshift saves us time and money - jpat
http://engineering.monetate.com/2014/06/18/moving-product-recommendations-from-hadoop-to-redshift-saves-us-time-and-money/
======
monstrado
These type of articles baffle me, you're comparing a high-performance
analytical database to a batch-orientated SQL engine.

The whole point behind these query engines on Hadoop (Hive, Presto, Impala,
etc) is to separate the database from the query engine. With these engines you
can project schemas over raw data in its original form, without having to load
it into a table. With Redshift, or other similar analytical databases, you're
forced to define a schema, and then load the data in row by row...bulk inserts
are very slow in comparison to Hadoop technologies.

Regardless, Hive in general should nver be used for interactive analytics,
that's not what it's intended for. Where Hive shines is when you can dump
250TB of raw text data into a folder and then run a SQL query to extract
useful information out of it. The extracted data could then be loaded into a
RDBMS like RedShift for real-time reporting.

With all that being said, if you want to run SQL queries on data in Hadoop at
the speeds of Redshift, you should have used Impala with Parquet, which is
known to be even faster than Redshift in many cases, and is based on the same
technology Google uses (Dremel and F1). The benefits of keeping your data in
Hadoop are enormous, not every problem can be solved using SQL. The same data
you're querying with Impala could actually be used to do machine learning
using Spark or Mahout. Maybe you want to start indexing one of your tables
into Solr to provide search capabilities on a subset of your columns to your
users...or maybe you want to use Giraph or Sparks' GraphX to do parallel graph
computation. The data never moves, there's still only ONE copy of that data in
Hadoop, and you can bring any kind of workload to it.

~~~
bbrunner
Redshift is an especially limited SQL engine considering it doesn't support
UDFs. It is wicked fast, but what you get in speed you lose in flexibility.
Current (well, February, but fairly current) benchmarks[0] place Impala and
Shark (SQL on top of Spark) within grasp of Redshift while pulling data from
disk and, for certain workloads, on par or faster than Redshift. This is
without using a columnar file format.

Impala is impressive technology, but it does require you to run dedicated
Impala daemons as it doesn't use map reduce under the hood. Shark is
especially interesting, however, because it is fast AND build on top of spark,
so you can run raw Spark jobs, SQL queries, graph processing and ML all on the
same cluster. Shark currently uses Hive to generate it's query plans, but the
Spark project is working on implementing it's own SQL engine called
Catalyst[1] that promises to be a significant improvement.

[0]
[https://amplab.cs.berkeley.edu/benchmark/](https://amplab.cs.berkeley.edu/benchmark/)

[1] [https://spark-summit.org/talk/armbrust-catalyst-a-query-
opti...](https://spark-summit.org/talk/armbrust-catalyst-a-query-optimization-
framework-for-spark-and-shark/)

~~~
monstrado
Although I have a lot of respect for the amplab, they did not do their due
diligence with that benchmark. Mainly for a few reasons, they didn't test
using columnar storage in Hadoop (ORC / Parquet), which is what Redshift is
using underneath (a proprietary columnar store). Also, the most complicated
query they ran was a two table join, and from what I can tell, there wasn't
any concurrent workload testing.

(disclaimer: I'm a Cloudera employee):

I recommend checking out the following blog, not because my employer wrote it,
but because the guys behind the benchmark did an incredible job making the
benchmark competitive. They also show metrics that a lot of the other people
are not showing, for example concurrent workload capabilities, CPU efficiency,
etc.

Impala, Hive (on Tez), Shark, Presto

[http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-
the...](http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-
hadoop-ecosystem-why-impala-continues-to-lead/)

~~~
weliam
Impala does not currently support Serde last when I checked, which limits its
usage for certain cases. And I would not treat any benchmark too seriously
since every vendor probably would only know/be willing to tune its own
products. Check the latest Spark SQL benchmark.
[http://databricks.com/blog/2014/06/02/exciting-
performance-i...](http://databricks.com/blog/2014/06/02/exciting-performance-
improvements-on-the-horizon-for-spark-sql.html)

~~~
monstrado
You pay a significant resource penalty when using Serdes, and since
performance is one of the biggest priorities to the Impala team, we decided to
leave this out for now. A very common workaround is to use Hive to generate
Parquet data from your custom data (using Serdes), and then use Impala for
querying the Parquet data.

I disagree with your statement regarding not treating benchmarks from vendors
seriously. As the article mentions, we made an effort to make these queries
run as efficient as possible, even going so far as re-writing queries on
competing engines to make them run faster. In fact, Databrick's engineers
assisted us in making the Shark benchmarks as good as they could possibly get.
The benchmark that I linked is very thorough, and even supplies the exact
queries / scripts we used to perform the tests so you can do them yourself.

------
rpedela
I chuckled when I read "We have a legacy data warehouse based in Hive and
Elastic MapReduce, with backing data stores in S3.". I guess things have come
full circle. It wasn't long ago that a relational database solution would have
been "legacy".

~~~
hyperliner
But "legacy" is not only when technology gets old, but also when solutions get
old. Maybe it was just a bad solution and they are moving it to a new
solution, not necessarily a new platform.

------
mattj
I've gone through a similar transition (hive to redshift) in a very large
scale data environment. Raw Hadoop / cascading is still very useful for more
complicated workflows, but redshift is so vastly superior to hive it's not
even funny. I thought I would miss adding my own UDFs, but this hasn't been an
issue at all. I'm under the impression presto is a similar improvement, but I
haven't spent any time with it.

One huge advantage of redshift over hive: you can connect with plain old
Postgres libraries, so you can build redshift results into your admin
interfaces, one off scripts, and anywhere else you're fine trading a few
seconds of latency for extra data.

~~~
endersshadow
Just as a quick note: You can use Postgres libraries because Redshift is a
slightly modified version Postgres 8.1 under the covers. In fact, almost all
massively-parallel-processing (MPP) databases are Postgres under the covers
(including Microsoft's PDW). It really speaks to how impressive Postgres is at
scaling. Even old releases, like 8.1!

~~~
mattj
Yup! My experience with redshift has actually made me curious to try out
Postgres (I've always used MySQL before this). The stricter SQL dialect was a
little odd at first, but I think I've become more comfortable with it over a
few months.

~~~
noir_lord
Do it, I switched to Posgress for pretty much everything last year and I
absolutely adore it, it just works, no stupid edge cases, excellent
documentation, nice tooling (pgadmin3 is better than commercial products I've
seen costing hundreds if not thousands).

I'm barely scratching the surface of what it is capable of yet as well (mostly
because I'm in ORM land most of the time).

------
ergest
Is it me or are people switching to non-relational data warehouse
architectures simply because it's en vogue? How many companies do you know
that have enough data where a non-relational DW would actually make sense? I
wonder, have we really pushed relational databases to their breaking point?

~~~
meritt
I've looked at and avoided doing anything serious with hdfs/mr for 6 years
now. I'm glad some people are starting to realize that re-processing your
entire dataset every single time you want to do something isn't very
efficient. I'm still waiting for lightbulb moment where the usefulness of it
really makes sense to me.

Can anyone point me to a book or blog that discusses _good_ uses of
hadoop/map-reduce?

~~~
jgrahamc
I'm waiting for the day people realize that materialized views in databases
are awesome and decide to incorporate them into a framework.

~~~
meritt
At least if you're using Oracle they are, as it supports auto-refreshing.
Postgres has only had them since 9.3 (and have to be manually refreshed).
Meanwhile MySQL is still struggling with regular views.

------
alanctgardner3
Another commenter pointed this out, but what you're trying to compute is
cosine similarity, in which case you're missing the normalizing part in the
denominator (the product of the magnitude of both vectors). In other words,
two items which both occur frequently will score higher than two items which
occur infrequently, but which co-occur higher than usual. This leads to a
tendency to over-recommend popular items.

When you were on EMR, you could have used Mahout's distributed collaborative
filtering, which has the benefits of being correct, and requiring zero coding.

Wikipedia explains here:
[http://en.wikipedia.org/wiki/Cosine_similarity](http://en.wikipedia.org/wiki/Cosine_similarity)

~~~
jpat
Thanks for the tip, I'll look into this more.

------
alaiacano
You should use something like tf-idf to normalize your cooccurance counts for
your recommender, otherwise you'll just end up recommending the most globally
popular products.

------
lsb
It's also unclear how many rows they're trying to do this on, and at what
frequencies; that's the crux of what turns this from a small-to-medium-data
problem, which you can easily solve on a large box with 10 lines of code, to a
big data problem, which requires completely different tooling

~~~
jpat
In my testing of this query, I ran it against a time range that included over
40 million purchase lines, and our configuration of Redshift returned the
result in ~6 minutes. That was much quicker than our legacy EMR
implementation.

Currently, we update our product recommendations nightly. However, the speed
up we see here from this reimplementation may allow us to update product
recommendations more frequently.

------
callesgg
I generally love sql and use it in all My stuff more or less.

The one thing I fucking hate is how hard it is to get the database to execute
querys in a efficient way on anything that is more than a simple join and
select.

------
zatkin
Wasn't Hadoop the first of it's kind in Big Data?

~~~
crb
Hadoop was the first (major) open-source implementation of Google's MapReduce
framework.
[http://research.google.com/archive/mapreduce.html](http://research.google.com/archive/mapreduce.html)

In terms of data warehousing and near-real-time query over Big Data, Google's
framework for that is called "Dremel",
[http://research.google.com/pubs/pub36632.html](http://research.google.com/pubs/pub36632.html)

Google offer Dremel as a service known as BigQuery.

------
NorthernDaemon
Those who want on-premise Redshift should try Actian Matrix. Redshift is just
Amazon's version of Actian Matrix (formerly Paraccel)

------
shamney
aside from the example in the original article, what sort of questions are
these tool used to answer?

