
Show HN: Goodreads Data Pipeline - san089
https://github.com/san089/goodreads_etl_pipeline
======
jonluca
As an aside, good reads is the slowest site I use on a regular basis. It's
genuinely shocking how there are 5 to 6 second page load times. I'm not sure
what their stack is but I'm always blown away by how any continues to use
this. It feels like a competitor could beat them by just being faster.

~~~
spullara
Probably Ruby on Rails.
[https://stackshare.io/goodreads/goodreads](https://stackshare.io/goodreads/goodreads)

~~~
traverseda
That stackshare site won't let me view content without creating an account.

So that's a downvote from me, please try to find a source that doesn't require
you register an account.

~~~
FalconSensei
Downvoting someone for contributing to the conversation, without providing any
better source. Great

~~~
traverseda
That's sort of what the downvote button is for, removing things that _don 't_
contribute to the conversation. A source that you can't look at without
capitulating to their dark patterns is more or less what that button is _for_.

Also explaining why you're downvoting is a good idea.

~~~
danielg6
Or you could have just found the site yourself and linked to it here like
everyone else does.

OP: Paywall Link You: Here’s the non-paywall link.

Not that hard. And if you think about it, OP typed the answer “Ruby on Rails”.
Would you have downvoted if he didn’t provide a source?

------
adam-_-
This is potentially quite interesting to me we are having conversations on/off
at work about data reporting, visualising etc., which is leading me to pay
attention to related topics.

However, it's lacking in any context explaining what you're trying to achieve
and why.

It's probably obvious to some people but for me, it's not, which I think is a
shame.

~~~
mrlatinos
Beyond just data replication/archival purposes, it seems you can use the this
to run analysis against Goodreads entire public dataset. This is much more
efficient than using their API alone.

------
habosa
Someone, somewhere has to be able to make a better alternative to Goodreads
right? The site is slow, ugly, and buggy. The functionality is so simple: I
tell you when I read a book and what I think about it.

I'm just shocked Amazon has been able to own this niche with so little effort.

~~~
drusepth
I've been working on a competitor for a while now, and the hardest part of
replicating functionality is the data. OpenLibrary is probably the best source
for book metadata online, but even their library dumps are riddled with
mistakes that manifest in weird ways as you start building your own library.
The Goodreads site sucks, but they've got surprising data quality that I don't
think anyone else has; and they have a super restrictive data policy so you
can't repurpose book data, reviews, shelf data, etc, even when users auth with
a Goodreads account.

It's a small moat, but definitely penetrable with more than a little effort.

~~~
johnmaguire2013
I wonder if a partnership with Kobo, or even better Nook (Barnes & Noble)
could help solve both the data problem and the issue with Kindle integration -
while potentially bolstering the e-reader that integrates with it as well.

~~~
Kihashi
I for sure would take that sort of integration into account when looking at a
new ereader. Although the primary thing I'd be looking for is library
integration a la OverDrive.

------
bilater
Nice - You can use UNION ALL instead of UNION in your query at the end if
you're confident the datasets don't overlap. Query is less expensive. I'm also
curious what the backfilling/recovery process is if something goes wrong and
you have to stop your 10 min load jobs.

------
prions
Really similar to the pipelines that I engineer/manage at my current company.
Although we have our Airflow on kubernetes.

One optimization though is separating your loading tasks from compute tasks.
This makes the pipeline more resilient and makes backfilling/reprocessing less
of a headache.

~~~
san089
Thanks for the tip. I actually thought about such separation, but it was too
late to make such changes. I already laid down the architecture till that
time. But you made a good point.

------
krmmalik
Super interesting. Surely there are some business cases of how someone could
use this data for good (?)

For example someone could show the disparity between a New York times
bestseller and the book getting the most amount of activity on GoodReads
(added to most shelves for example)

------
gwern
Is this limited strictly to the GoodReads API or does it pull in more
interesting data like the shelf-tags? When I did
[https://www.gwern.net/GoodReads](https://www.gwern.net/GoodReads) the other
month, I had to literally scrape shelves by hand because the API doesn't cover
them and they lie to bots.

------
wefarrell
3 xlarge EMR instances sounds like overkill assuming a volume of around 11gb
every 10 mins. Using postgres COPY I've loaded larger files into tables in
seconds. Semi complex queries will also take seconds depending on indexes. My
understanding is that EMR doesn't make sense unless you're processing
terabytes.

~~~
TuringNYC
I read through the Readme but didn’t see any volume or velocity figures (I saw
the entity count, but what does this translate into w/r/t bytes?)

Anyone run this who could comment on the metrics and, consequently, server
sizing?

------
sails
Nice overview. I'd suggest to anyone interested in doing something like this
to also consider the much simpler managed approach of using tools like: *
Stitch [etl/elt] * Snowflake [data warehouse] * dbt [transformations]

I'd recommend taking a look at dbt [1] for a refreshing approach to this
domain. The AWS EMR Redshift approach is great if you _know_ you'll need all
the configurability, but chances are you won't, and even with that said, the
above stack provides it as necessary.

[1] [https://blog.getdbt.com/analytics-engineering-for-
everyone/](https://blog.getdbt.com/analytics-engineering-for-everyone/)

------
thrower123
The worst thing about Goodreads is that it is horribly biased by terrible
people. The Historical Fiction category is exceptionally terrible.
Unfortunately it is integrated heavily by Amazon.

~~~
slightwinder
Biased toward what?

~~~
FalconSensei
I've seen many cases of massive 1 star ratings of books that were not even
published yet, because people didn't like the author as a person, or because
it dealt with a sensitive subject.

I've also seem the opposite, with 5 star ratings on an unreleased book because
the author (as a person) is like by the community

------
kgraves
Interesting, would like some detail on the cost of this ETL setup on AWS,
unfortunately I can't see anything on this from the project page.

~~~
jmedefind
Since they are owned by Amazon. I would think their cost is close to nothing.

~~~
dswalter
That's a bit of a misconception. From what I understand, Amazon's non-AWS
branches don't get deeply-discounted services from AWS. There is a discount,
but it's not enough to turn dark skies into sunshine and rainbows.

Amazon tends to want every part of itself to be in ship-shape, and giving
itself a massive discount would discourage efficiency in non-AWS parts of the
business.

Disclosure: neither a current nor former Amazon employee.

~~~
augmachina
This is a misconception. Amazon wants to depict itself as wanting every part
to be in ship-shape, but it does not operate that way and AWS is treated like
any internal resource like printers and staplers.

------
skandl
Amazon literally siphons off the data and has invested so little in its users.

Any recommendations for an alternative?

~~~
FalconSensei
LibraryThing if you don't mind the ugly website

------
ldng
For my own education, what is "Data Lake" ? Data wharejouse is "has-been" and
that's the new hype way to call it ?

~~~
bdibs
A data lake is raw, unstructured data vs. a warehouse where everything is
already parsed, processed, and currently query-able is my understanding.

~~~
ldng
Thanks. I see that there is a wikipedia page now. Only started to hear about
it 1 or 2 years ago and did not find to much on it at the time.

------
swyx
i'm not a data science guy so I need an ELI5 on this - is scraping all of
goodreads and passing it into a data pipeline? seems like a 3rd party project.
is this just to demonstrate Data/ETL skills? is what are some practical uses
of this? Sorry it's not obvious to me.

