Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Goodreads Data Pipeline (github.com/san089)
213 points by san089 on Feb 27, 2020 | hide | past | favorite | 65 comments

As an aside, good reads is the slowest site I use on a regular basis. It's genuinely shocking how there are 5 to 6 second page load times. I'm not sure what their stack is but I'm always blown away by how any continues to use this. It feels like a competitor could beat them by just being faster.

That stackshare site won't let me view content without creating an account.

So that's a downvote from me, please try to find a source that doesn't require you register an account.

Downvoting someone for contributing to the conversation, without providing any better source. Great

That's sort of what the downvote button is for, removing things that don't contribute to the conversation. A source that you can't look at without capitulating to their dark patterns is more or less what that button is for.

Also explaining why you're downvoting is a good idea.

Or you could have just found the site yourself and linked to it here like everyone else does.

OP: Paywall Link You: Here’s the non-paywall link.

Not that hard. And if you think about it, OP typed the answer “Ruby on Rails”. Would you have downvoted if he didn’t provide a source?

I just accessed the stackshare URL without signing in.

I was able to access that link but couldn't do anything else without having to sign in. I also can no longer go back to that link without signing in.

I had to open it in a private browser window

Couldn’t agree more. It must have one of the worst usage-to-enjoyment ratios of any site.

This is potentially quite interesting to me we are having conversations on/off at work about data reporting, visualising etc., which is leading me to pay attention to related topics.

However, it's lacking in any context explaining what you're trying to achieve and why.

It's probably obvious to some people but for me, it's not, which I think is a shame.

Beyond just data replication/archival purposes, it seems you can use the this to run analysis against Goodreads entire public dataset. This is much more efficient than using their API alone.

the architecture also seems pretty complex - i am wondering at what level of requirements or data complexity people should consider something like this, as opposed to running a little cronjob on a $5 server somewhere.

not dissing the author, genuinely trying to understand the spectrum of data science needs

Someone, somewhere has to be able to make a better alternative to Goodreads right? The site is slow, ugly, and buggy. The functionality is so simple: I tell you when I read a book and what I think about it.

I'm just shocked Amazon has been able to own this niche with so little effort.

I've been working on a competitor for a while now, and the hardest part of replicating functionality is the data. OpenLibrary is probably the best source for book metadata online, but even their library dumps are riddled with mistakes that manifest in weird ways as you start building your own library. The Goodreads site sucks, but they've got surprising data quality that I don't think anyone else has; and they have a super restrictive data policy so you can't repurpose book data, reviews, shelf data, etc, even when users auth with a Goodreads account.

It's a small moat, but definitely penetrable with more than a little effort.

The good data quality is actually an artifact of humans being involved in every step of the cataloging process. There's a large group in goodreads called the GoodReads Librarians, and that group has around a hundred thousand dedicated people who go through and flag anomalies, correct titles and indexes etc

Book publishers or people who've worked in book publishing will know that the book database is one area you don't want to mess with unless you know what you are doing. ISBN's are not the be all and end all of the story, and when you start taking into account special editions, covers, ebook editions, language translations, you'll start to realize that the Book Catalog system going back in history, including Dewey decimal system is a marvel of human achievement.

Of course establishing a good quality index is going to take work. People often forget that quality take human work and effort.

EDIT: I lied. I changed the number from my original estimate of a "few hundred" to "hundred thousand". The Goodreads Librarians group has 103718 members as of when I just peeked now - so it's actually a large number of humans submitting fixes to their catalog.


If you take a look at the kind of discussions taking place, those are the kinds of things any competitor to Goodreads needs to know about.

I wonder if a partnership with Kobo, or even better Nook (Barnes & Noble) could help solve both the data problem and the issue with Kindle integration - while potentially bolstering the e-reader that integrates with it as well.

I for sure would take that sort of integration into account when looking at a new ereader. Although the primary thing I'd be looking for is library integration a la OverDrive.

Facts are not copyrightable and scraping has been determined to be legal. IANAL, but I'm not sure law would protect them from the factual metadata about books being repurposed.

Not being illegal wouldn't protect you from a crushing lawsuit though. Especially since the details likely vary (linked in data was publicly accessible, not sure if Goodreads requires a login).

Demanding perfect data is a waste of time that will let you procrastinate your product indefinitely.

If you're aiming for "like X, but for people who are actually interested in the market X supposedly serves" then maybe this isn't true.

You could nick the scraping code from Calibre...

The scraping part is probably not the complicated portion of the endeavor.

As a regular Goodreads user, I've never cared about the site's relatively slow load times. What's important to me is the trust that the site will still be around in 20 years, largely thanks to the Amazon ownership and Kindle integrations. I wouldn't have that same faith in an anonymous competitor.

Goodreads was that anonymous site once upon a time. You're just not an early adopter and that's okay. That's no reason to not create a better alternative.

I think it's a little insincere to compare Goodreads' release 13 years ago with a competitor launching against it now.

The kindle integration, the amount of correct data, and the fact that it's not going to vanish in the next year is what keeps me using GoodReads.

Not to imply this functionality is complex, but really the most important thing for me are the lists:

I _love_ that I can take a book I enjoyed, see it's on a list of "Best Magic Systems", and note what was rated even better for its magic system

A simple method of discovery for me



I find the site decently fast, definitely ugly but then again I don't want it to get a reddit-style redesign either. The information density is ok right now, and I'm actually impressed by the wide range of functionalities they have, related to reviewing books and updating your progress.

The thing that goodreads has that will be hard to replicate is the Kindle integration.

This is the Achilles heel of any potential competitor. The lazy integration means there is a big subset of users who simply won't engage with a competitor because it requires more work. Couple that with the social graph Goodreads already has and you're looking at a huge moat.

Have you tried looking? There's LibraryThing and a couple others.

I don't think there's much value left on the table in the niche, though. Kindles have first-class Goodreads sync and even a Goodreads button in their global navbar. And Goodreads' competitors, for the few people who don't want to use Goodreads, already have a deep rut of incumbency.

Even you, who has supposed great issues with Goodreads, apparently wasn't bothered enough to even see if competitors existed all this time, much less before writing your comment. Doesn't bode well for the Goodreads' competitor market, lol.

None of the competitors I'm aware of have fuzzy search though, which is pretty annoying.

"color prple"

LibraryThing: 0 results

Goodreads: 2,000+ results and they're well sorted

Reddit has terrible search too, but you can appreciate that "Reddit but with good search" isn't all it takes to compete with Reddit. That's 0.001% of the work.

And of course Goodreads has issues of its own, but none of them are show-stoppers for most people, especially few of the people who just use it as a glorified Excel spreadsheet.

I only chuckle about this because, like many enterprising HNers, I myself have considered building a Goodreads competitor in the past and even managed to build the ol' weekend prototype (i.e. 0.001% of the work). It's one of those projects where you start and, after you get some of the easy things done like fuzzy search, you go "wait, wtf am I doing? Who would switch to this?"

Using and improving OpenLibrary is also alluring, but pretty hard to do without an application with actual users that have some sort of "edit book" functionality that you can then moderate and submit upstream to the OpenLibrary data source.

For example, look how ListenNotes.com lets users edit its podcast database: https://www.listennotes.com/podcasts/the-joe-rogan-experienc... -> the "Edit" tab.

I think most people use Reddit to just browse the subreddits. GoodReads is about searching for books an adding then to your shelves, many of those might be books that someone just mentioned to you in passing, or you don't remember the full/correct name.

Different usage than reddit

GoodReads already doesn't look awesome, but whoa, LibraryThing looks like it hasn't been update in the past decade

Just signed up for FediReads[1] last week. It's a decentralized Goodreads with ActivityPub, and open source[2].

[1] http://fedireads-test.glitch.me/

[2] https://github.com/mouse-reeve/fedireads/

> This is just a demo, any data here may be deleted without warning. sign up for email updates

So basically, keep using GoodReads for now?

I've been working on something like this. Super simple, like an Spreadsheet of what you read but as a SaaS. I was thinking in monetize it a là Pinboard: focused on privacy. Like, $3 per month and you have it, without Amazon or Google knowing what books you read and how you rate them.

Nice - You can use UNION ALL instead of UNION in your query at the end if you're confident the datasets don't overlap. Query is less expensive. I'm also curious what the backfilling/recovery process is if something goes wrong and you have to stop your 10 min load jobs.

Really similar to the pipelines that I engineer/manage at my current company. Although we have our Airflow on kubernetes.

One optimization though is separating your loading tasks from compute tasks. This makes the pipeline more resilient and makes backfilling/reprocessing less of a headache.

Thanks for the tip. I actually thought about such separation, but it was too late to make such changes. I already laid down the architecture till that time. But you made a good point.

Super interesting. Surely there are some business cases of how someone could use this data for good (?)

For example someone could show the disparity between a New York times bestseller and the book getting the most amount of activity on GoodReads (added to most shelves for example)

Is this limited strictly to the GoodReads API or does it pull in more interesting data like the shelf-tags? When I did https://www.gwern.net/GoodReads the other month, I had to literally scrape shelves by hand because the API doesn't cover them and they lie to bots.

3 xlarge EMR instances sounds like overkill assuming a volume of around 11gb every 10 mins. Using postgres COPY I've loaded larger files into tables in seconds. Semi complex queries will also take seconds depending on indexes. My understanding is that EMR doesn't make sense unless you're processing terabytes.

I read through the Readme but didn’t see any volume or velocity figures (I saw the entity count, but what does this translate into w/r/t bytes?)

Anyone run this who could comment on the metrics and, consequently, server sizing?

Nice overview. I'd suggest to anyone interested in doing something like this to also consider the much simpler managed approach of using tools like: * Stitch [etl/elt] * Snowflake [data warehouse] * dbt [transformations]

I'd recommend taking a look at dbt [1] for a refreshing approach to this domain. The AWS EMR Redshift approach is great if you _know_ you'll need all the configurability, but chances are you won't, and even with that said, the above stack provides it as necessary.

[1] https://blog.getdbt.com/analytics-engineering-for-everyone/

The worst thing about Goodreads is that it is horribly biased by terrible people. The Historical Fiction category is exceptionally terrible. Unfortunately it is integrated heavily by Amazon.

Biased toward what?

I've seen many cases of massive 1 star ratings of books that were not even published yet, because people didn't like the author as a person, or because it dealt with a sensitive subject.

I've also seem the opposite, with 5 star ratings on an unreleased book because the author (as a person) is like by the community

The historical fiction category, in particular, is wildly female-biased towards historical-romance.

Interesting, would like some detail on the cost of this ETL setup on AWS, unfortunately I can't see anything on this from the project page.

Since they are owned by Amazon. I would think their cost is close to nothing.

That's a bit of a misconception. From what I understand, Amazon's non-AWS branches don't get deeply-discounted services from AWS. There is a discount, but it's not enough to turn dark skies into sunshine and rainbows.

Amazon tends to want every part of itself to be in ship-shape, and giving itself a massive discount would discourage efficiency in non-AWS parts of the business.

Disclosure: neither a current nor former Amazon employee.

This is a misconception. Amazon wants to depict itself as wanting every part to be in ship-shape, but it does not operate that way and AWS is treated like any internal resource like printers and staplers.

AWS basically finances the rest of Amazon. It's 70% of its revenue (that is public info). Except for retail, the rest is all losses. So the discounts don't matter much, other branches just try to save money (frugality is one of Amazon's core values) but basically get what they need.

This repository is not associated with Amazon.

No, but Goodreads (the child subject under discussion) is.

Amazon literally siphons off the data and has invested so little in its users.

Any recommendations for an alternative?

LibraryThing if you don't mind the ugly website

For my own education, what is "Data Lake" ? Data wharejouse is "has-been" and that's the new hype way to call it ?

A data lake is raw, unstructured data vs. a warehouse where everything is already parsed, processed, and currently query-able is my understanding.

Thanks. I see that there is a wikipedia page now. Only started to hear about it 1 or 2 years ago and did not find to much on it at the time.

i'm not a data science guy so I need an ELI5 on this - is scraping all of goodreads and passing it into a data pipeline? seems like a 3rd party project. is this just to demonstrate Data/ETL skills? is what are some practical uses of this? Sorry it's not obvious to me.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact