
Show HN: BestSFBooks, Mashup of the Best SF/Fantasy Books - gurgeous
I'm releasing BestSFBooks today:<p>http://www.bestsfbooks.com<p>I was inspired by this HN post from a few weeks ago:<p>http://news.ycombinator.com/item?id=2978027<p>BestSFBooks ranks science fiction/fantasy books according to how many awards they've won or been nominated for. I included all the big ones (Hugo, Nebula, etc.) and also more obscure awards that I like such as SF Site Editor's choice. Once I had all the awards in there it was easy to start creating a New Book list based on award winning authors.<p>I could tell the app was working as soon as I saw the two book lists on the home page. They're excellent!<p>The stack is virtually identical to the stuff I used to build PickHealthInsurance. BestSFBooks is built with Rails 3.1.1 (HAML, Sass, CoffeeScript). It's hosted on Heroku and MongoHQ. I used the Twitter Bootstrap CSS toolkit, which I continue to find hugely innovative and useful. As always, data acquisition and cleanup was the hardest part.<p>I've wanted to build something like this for quite a while. As my time becomes more valuable, I'm becoming less tolerant of bad books. For a laugh, check out the "prototype" that I created ten years ago - http://www.gurge.com/amd/top100<p>Please send feedback!
======
bgraves
_"As always, data acquisition and cleanup was the hardest part."_

I'm most interested in this piece of the project. What were your particular
tools and methodologies? How long did it take you, once you identified your
data sources? Any interesting stumbling blocks or problems that were solved
along the way?

~~~
gurgeous
I've written a lot of data tools as part of Urbanspoon and subsequent
startups. I like to collect publicly available data, clean it up, normalize
it, and then release it in a more useful way.

Hot tips for crawling data:

    
    
      - Cache pages locally while you work on the indexing
      - Nokogiri is awesome
      - Don't be afraid to use regular expressions
      - Initially, put data into a spreadsheet (not the db).
        That way it can be checked in and diffed.
    

I also have a lot of subtle tricks for cleaning up messy data. For example, to
see if two similar authors refer to the same person, I have a method that
converts an author name to an author key. The key is just like the name, only
it's been uppercased, apostrophes removed, etc. Plus weird stuff like this:

    
    
      # replace all vowels with the letter E
      s = s.gsub(/[AEIOUY]+/, "E")
    

It's little things like this hack that make a big difference in data quality.

(edit: formatting)

~~~
bgraves
Thanks! I've been doing some scraping projects lately and really like it a
lot. There's a pretty steep learning curve, but it gets easier and easier as
you go along, I think.

1\. Caching pages is definitely a great idea while debugging. Especially if
the data source has a request limit :)

2\. I've never heard of Nokogiri, but it looks like BeautifulSoup for Ruby.
I've found that Python has worked for everything I need so far, but thanks for
the reference.

3\. I suck so bad at regex, but using it more will help me climb that
mountain.

4\. One tip I've used is writing out the "INSERT INTO TABLE..." statements
along with the scraped results. I definitely use CSV (and Google Refine) for
general clean up and spot checking.

5\. You should write a 'Data Scraping One-liners Explained' ebook :)

~~~
semanticist
I spent the first half of this year writing scrapers for every newspaper in
the UK. My Top Regex Tip is <http://rubular.com/> \- this thing saved me HOURS
of my life.

------
natbro
OK, I have to admit at first I thought there was something wrong with the site
because I didn't recognize enough books... but after reading some excerpts of
books off the lists, I'm psyched -- turns out I just haven't been finding the
good books for a long time, so now I can.

Suggests:

* show me excerpts on-page (if possible from amzn?)

* allow community +1/-1 on books and generate lists based on top-rated by site users

* commenting, facebook or disquss, on each book

(little issue: Facebook "like" button on main page and book pages doesn't seem
to be working -- dunno if that's facebook's problem not yours)

~~~
gurgeous
Thanks Nat - I fixed the like button. Turns out that you have to specify the
href param when using the iframe, unlike twitter.

------
phrotoma
This fairly reeks of awesome. If the ones I don't recognize are as great as
the ones I do, I'd say you have a damn fine site there!

------
sammyo
A hook into Google book library reference 'available at your local library'
would be awsome.

------
wtf242
looks awesome! I wrote a similar site that aggregates the general fiction and
non-fiction based on how many awards and lists they are on.
<http://thegreatestbooks.org>

I built it on Rails as well

------
gurgeous
Update - GeekWire picked it up:

[http://www.geekwire.com/2011/hunger-science-fiction-books-
sp...](http://www.geekwire.com/2011/hunger-science-fiction-books-sparks)

Also, thanks to webwright for the title suggestion.

------
100k
Cool! Very nice.

I'm sort of slowly working my way through the double winners of the Hugo and
Nebula (kind of a lifetime goal, I guess).

It took me a while to find the "Hall of Fame" for books, but that is what
immediately wanted from a site like this (I don't care so much about the year-
by-year rankings). Maybe make it more prominent?

EDIT: also, I think there's a big difference between nominations and winning.
Would be cool to sort based on actually won awards, not just nominations.

------
javanix
This is awesome - I've been searching for somewhere to find _good_ new
SF/Fantasy for a while that doesn't just entail blindly searching Amazon.

------
Urgo
I like it. One suggestion or request really however. Can you add a link to
Audible in addition to amazon & the kindle for us audiobook fans?

------
adamzochowski
Awesome job. It is similar to existing <https://www.worldswithoutend.com/>

------
2mur
This is really, really fantastic. Been a huge sf/f reader my whole life
(collect 1sts of lots of favorite authors... hovering around 6000 now). I
basically used to do this manually: end of the year check Locus, check SF Site
lists, check Hugos... read them all.

------
md1515
This is fantastic. I think you should follow natbro's lead on a few things -
allow disquss commenting and show excerpts of the book.

Also I have a series to add: Harry Turtledove's "Darkness" series. It is like
5 books, if I remember correctly. I'm at work...

------
pasbesoin
Yay!

<http://www.bestsfbooks.com/b/3119/The-Children-of-the-Sky>

The Children of the Sky Series: Zones of Thought #3 by Vernor Vinge (Tor,
Oct-2011)

------
dougws
There seems to be a really heavy bias towards relatively new SF; is this
because there are more awards now than there used to be, or because you only
have data going so far back?

------
gigawatt
This is fantastic! Perfectly simple and useful.

Only glitch I've run into so far is that the search function doesn't work. As
far as I can tell, there's no type="submit" for the form.

~~~
jogrimst
Hello. That is because it is a auto-complete search field. If you search for
something that does not exist, nothing will happen. If you type for example
"Krak", you will get auto-complete suggestion for "Kraken".

Maybe it would be an idea to show a message like "No results..." if what the
user types does not return any matches in this search function?

------
wccrawford
You're going to cause me to spend far too much money. Dang you.

------
peapicker
Well done, you! I like the site a lot, bookmarked.

