
Goodreads offloads DynamoDB tables to S3 and queries them with Athena - mooreds
https://aws.amazon.com/blogs/big-data/how-goodreads-offloads-amazon-dynamodb-tables-to-amazon-s3-and-queries-them-using-amazon-athena/
======
kaycebasques
I use Goodreads as my main book tracker and reviewer. My impression is that
Amazon bought it for whatever reason, and then ignored it. It seems like
there's so much more potential to a social community around books than what
Goodreads offers. The UI also never fails to disappoint. One of the core user
features, searching for books, has all sorts of weird quirks. I'll use a very
specific query that should guarantee a hit, and nothing comes up. Or, the book
that I want will come up for a second, and right as I'm going to select it,
the results completely change and the book I want has vanished.

~~~
a_bonobo
The website search seems to not be tokenized, it seems to use the entire word
only... so if there's a slight difference it won't find anything.

The main thing I've noticed in Goodreads since they were bought is a strong
focus on Kindle integration. The newer Kindles all have Goodreads integration,
you can send notes etc. to the Goodreads profile, the site itself has changed
very little

(although Goodreads changed a lot of rules around the ability to delete user
reviews if they focus too much on author behaviour around the time the Amazon
purchase went through
[https://www.washingtonpost.com/blogs/compost/wp/2013/09/23/i...](https://www.washingtonpost.com/blogs/compost/wp/2013/09/23/is-
good-reads-new-policy-really-censorship/?noredirect=on&utm_term=.356b12aa12ab)
)

~~~
psychometry
Feature-rich text search with elasticsearch is stupidly easy. It's a real
shame.

~~~
byrnehollander
I haven’t found this to be exactly the case - I recently worked with an
engineer for more than a month tuning Elasticsearch (which required non-
trivial changes to our catalog service). (It also took a good amount of effort
to coordinate these changes across devices.)

Like Goodreads, we have a fairly constrained catalog and we probably get a
similar amount of queries.

The search results are now much, much better, but ES has a pretty poor edit
distance / fuzziness algorithm so they still aren’t perfect.

~~~
saganus
I know that the tuning you made is probably very context-dependent, but could
you elaborate a bit on what kind of things you tuned up?

I am about to start using ES for a project and just knowing which kinds of
things could be useful to tune would be helpful.

Thanks!

~~~
byrnehollander
Sure thing!

I'll see if we can make a blog post, but here are a bunch of things that jump
out (I work at a company that sells tickets to live events, so our users
search for 'live events' like sporting events / concerts and 'performers' like
teams and musicians):

\- word order matters _a lot_. we did a lot of fiddling with the n-gram
tokenzier
([https://www.elastic.co/guide/en/elasticsearch/reference/curr...](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-
ngram-tokenizer.html)).. we ended up making word order matter a good amount
(e.g., 'new york' vs 'york new' return very different results... considering
them the same resulted in a lot of noise)

\- where the user is searching from is pretty important -- we would fetch the
25 best results and then boost (i.e., reorder) them based on the user's
distance from the event venue or the sports team's home venue. we also
experimented with fetching more and more results (up to 250) and then boosting
from this larger result set. note that ES couldn't take location into account
out of the box -- we had to manually boost on the ES output

\- we set up versioning with our autocomplete endpoint so we could more easily
A/B test variants (highly recommend this)

\- we built a system so non-technical employees could create "synonyms." for
example, "nyc" could expand to "New York City." we also worked with our data
science team to get a list of bad queries that might need synonyms to improve
them. (we also automatically triggered a real-time re-index on synonym
creation)

\- we similarly had an "expectations" tool for bug reporting and finding
patterns from common bugs

\- we had to add a bunch of other metadata / suffixes to our documents. for
example, we might want to return a 1pm Yankees game on August 4 when someone
queries "august yankees afternoon game". so we have to interpret the time and
add the month to what's being queried. similarly, we want this event to return
when someone queries 'nyc baseball', so we need to ensure the league/sport is
associated with the event document

\- we also had to add "stop words" that we ignored when querying. these
include 'game(s)', 'versus', 'concert(s)', 'tickets', etc

\- we have an internal definition of performer or event "popularity", and
needed to normalize this so ES's "match score" made more sense. (we had
limited success here)

\- their documentation describes fuzziness as: `fuzziness is interpreted as a
Levenshtein Edit Distance — the number of one character changes that need to
be made to one string to make it the same as another string` which is overly
simplistic and really messy to override (we decided against it)

\- because we had two different entities in our results ('events' and
'performers'), we had to figure out how to compare different entities (it was
generally easier to compare results within entities) based on what was
returned, time to event, location of event, and home location of the
performer. we also added additional entities / pages on an ad-hoc basis which
further complicated things

\- we also needed to exclude low quality performers and events from our
catalog (e.g., performers with no events, events with no tickets for sale)

In addition to configuring ES, it was pretty difficult to settle on a KPI
because it's not that easy to put searches in the context of the entire user
session... we could see if a given query resulted in: the user clicking on a
result, or no search results, or the user deleting everything in the box and
starting over, but we had a hard time following the user and seeing if the
click led to a purchase.

Also, as a disclaimer, I didn't actually write any code for this project (I'm
a product manager). But I did take a computational linguistics class in
college and worked very closely with the developer :)

~~~
saganus
Awesome answer!

I'm hoping you do find the time to write a blog post on this.

Thanks a lot!

------
iamsomewalrus
Hey! I’m the author of this post. I’m pretty chuffed to see this here. Happy
to answer any questions.

~~~
mooreds
Why didn't you push the data into a traditional data warehouse or sql
database?

~~~
ecnahc515
For a lot of teams, S3 is a data warehouse, and you can treat it just like
HDFS for the most part with most things in the big-data ecosystem. Presto
works well for letting you access it from these locations without having to
explicitly import it (assuming it's in a traditional data warehouse or a
common SQL DB).

~~~
wenc
I wonder if anyone here has a good heuristic for identifying the conditions
under which using S3 + SQL layer as a data warehouse is a better choice than a
SQL database?

I've been exploring the former and it seems to only make sense if the size of
your data is at a scale that is beyond what a single SQL database instance can
handle, and even then, you can continue to scale out with systems like Citus
so the limit isn't a hard one. SQL gives one so much (data mutability,
consistency, indexes, etc.) that I am hesitant to give it up unless the
tradeoffs make sense.

~~~
bcbrown
I've worked with a S3 + SQL system. It was used for serving data for a
reporting dashboard where the stored data was in the 0.1-10 TB range. As the
use case was only semi-interactive (users didn't mind waiting 1-10 seconds for
a report), and all the queries were pre-defined, this solution was a good fit.

I think it makes sense when there's no in-place updates; either querying
write-once data like logs or the output of batch data processing roll-ups that
replace the previous data. The less you need the relational model (like
joins), the better, but some of those needs can be met through careful design
of the storage schema and denormalization.

I wouldn't advocate this sort of solution if your requirements include in-
place updates of existing data, frequent/granular updates of new data,
expressive ad-hoc queries that use the full capability of relational algebra,
or tight latency requirements. You also lose the safety net of referential
integrity and table-level constraints, as those are now enforced in custom
code that can have bugs.

I would say maintaining this system cost about a half-engineer for ongoing
maintenance and new functionality.

------
polskibus
Weren't Goodreads bought by Amazon a couple of years ago? If so, they might've
been pushed to do the move (to microservices, s3, etc) to comply with
corporate guidelines/policy not because there wasn't a better/more efficient
way to scale.

~~~
iamsomewalrus
There’s no real policy that I’m aware of internally for teams to use
microservices. Amazon has a lot of tooling to make it easy to spin up
services, however.

The first major project after being acquired was to make a pared down
Goodreads experience available on the Kindle Paperwhite. Our first services
came out of that initiative to provide a buffer between the Kindle traffic and
the Goodreads Rails app.

That being said I’ll be the first to caution small teams should avoid
microservices at first for fear of creating a distributed monolith.

~~~
toomuchtodo
Are internal Amazon projects cross charged for their AWS use? Or is it
essentially free for internal use?

~~~
iamsomewalrus
Come join my team and I’ll be happy to spill all the beans!

~~~
jtloong
Do you guys have internships on the Goodreads team?

~~~
iamsomewalrus
Absolutely! We have internships in San Francisco and Seattle. Usually, they
coincide with the summer, but we on occasion have Fall interns.

Reach out to me feeneyj @ amazon if you'd like I can forward you to the right
people.

------
Jemaclus
I like the approach, and I've been considering building something similar.
Nice writeup. Is this used only for BI or is it used for real-time queries?
The reason I ask is because goodreads.com is slow AF, and performance is a
concern for me at this point.

~~~
iamsomewalrus
We use carrier pigeons from an S3 data center high in the Himalayas to a
CloudFront distribution center in Atlantic City (don't ask me, ask the
pigeons) to serve all requests.

Just BI.

~~~
CobrastanJorji
It annoys me that AWS only supports RFC 1149. When will RFC 2549 support be
added?

~~~
iamsomewalrus
You, dear person, win my undying support and affection.

------
manojlds
First lesson - it's taken Goodreads so long and so many users to actually
start moving to microservices.

~~~
iamsomewalrus
In general, it's not uncommon for startups of the Goodreads vintage to outgrow
a simple Rails app and use a more service oriented architecture. We've
actually had services for about 5 years now, but we haven't been very vocal
about it.

As I mentioned in another comment I, personally, wouldn't advocate for new
teams or startups to use a service oriented architecture right out the gate.
It's too easy to end up with a distributed monolith (circular dependencies
between services, services uptimes that are tightly coupled). Engineers also
tend to underestimate the build tooling, observability excellence and
discipline you need to make it seamless.

------
dglass
We did something similar at a previous position I had, except we set up an
amazon lambda function that was triggered on every insert or update to a
DynamoDB table. The lambda function flattened the updated record and inserted
it into our redshift cluster, which gave us a real-time ETL pipeline for our
DynamoDB data. That allowed us to report on our DynamoDB data just like our
relational data.

~~~
iamsomewalrus
We have another GR team that does that and it works well. It’s complicated by
the fact that at Amazon every team uses their own AWS account. Concretely, a
service’s DynamoDB tables don’t exist in the same AWS account let alone the
same VPC as the redshift cluster. Obviously, you can figure out the
permissions, etc.

We’re trying to get to a place where we have the data in S3 for engineers to
build products off of and for the oncalls to do sanity checks and the data in
Redshift for our BI needs.

------
mapleoin
What I don't get about Athena is what happens after you've put the data in
Athena? Fine, you've got SQL and tabular data, but the type of BI I've had to
do usually has a graph or some other visual representation at the end rather
than a table. There's only so much data you can import into Excel from a CSV
that Athena produces. Usually I find periscope/cluvio to be much better tools
for this and then you need to go to redshift. So why bother with all the data
pipeline to Athena? Does anyone use this as well as Periscope/Cluvio and can
chip in?

~~~
iamsomewalrus
Athena is just a front end for the data that a typical user can understand
(SQL!). the real value is that the data is: in S3, in a more efficient format
(parquet), and available in the Glue catalog.

The other replies got it right w.r.t. other BI tools. If you’re using Tableau
I think it integrates with Redshift, right? In that case Redshift Spectrum is
an option.

If you don’t have any existing BI tools then Quicksight is an option or
alternatively you can spin up an Elastic Map Reduce (EMR) cluster with your
fav open source BI tools

------
the_arun
Isn't it expensive to use S3+Athena instead of DynamoDB?

~~~
iamsomewalrus
This architecture is meant for business intelligence purposes, not for oltp
queries. You're right, it would be pretty expensive to power a user facing
service this way.

However, I read a harrowing / awe-inspring blog post about someone doing just
that. So...¯\\_(ツ)_/¯

------
ryanmarsh
This is pretty much straight from the DynamoDB best practices. Offload
infrequently accessed data (think time series data from previous months) to S3
and use another tool to query it.

[https://docs.aws.amazon.com/amazondynamodb/latest/developerg...](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-
time-series.html#bp-time-series-high-volume)

~~~
iamsomewalrus
+1 Everything I've learned the hard way with DynamoDB was in the Best
Practices documentation the whole time.

Now I've learned to read the Best Practices section of any AWS service before
I start implementing. It saves a lot of heartache

------
mark_l_watson
How does Athena work? The charge is $5 per terrabyte scanned, which indicates
(maybe) that no indices are used and queries are processed by a scan through
the data. Is this correct?

~~~
iamsomewalrus
The docs are a good place to start to get a sense of what Athena can do:
[https://docs.aws.amazon.com/athena/latest/ug/what-
is.html](https://docs.aws.amazon.com/athena/latest/ug/what-is.html)

There are no indexes in the traditional MySQL / Postgres sense of the word.
You can, however, layout the data to make your querying more efficient. See:
[https://docs.aws.amazon.com/athena/latest/ug/partitions.html](https://docs.aws.amazon.com/athena/latest/ug/partitions.html)

------
mariogintili
how can I make my data more complicated? DynamoDB(non relational, key-value
esque WTF-store) + something else

~~~
iamsomewalrus
Carrier pigeon transport?

------
desireco42
It looks like busy work to me. I love Rails but sometimes you get to teams
where people just don't know what to do with themselves, usually good
developers, and they come up with yak shaving and reinventing the wheel for no
good reason.

From what I can gather, Amazon acquired them and now they have to figure out
something to do. There is plenty of UX to fix.

------
hitekker
Why is the author’s HN comment dead?

Joe feeneys HN account is indeed iamawalrus; he wasn’t impersonating anyone.
Further his comment just said:

“Hey! I’m the author of this post. I’m pretty chuffed to see this here. Happy
to answer any questions.”

I want to hear what he has to say.

~~~
js2
Unsure, but I vouched for it so it's undead now.

~~~
hitekker
Thanks. I’m guessing it must have triggered some kind of filter.

------
aws_user_111
With the launch of [https://aws.amazon.com/about-aws/whats-new/2018/07/aws-
glue-...](https://aws.amazon.com/about-aws/whats-new/2018/07/aws-glue-now-
supports-reading-from-amazon-dynamodb-tables/)

Wouldn't using ETL Glue job to directly dump data in parquet format be better?

~~~
iamsomewalrus
100%. Expect a follow up in the future with a simplified pipeline.

~~~
aws_user_111
Cool.

