
Ask HN: Extracting book titles from comments - _6cj7
Using named-entity recognition, how to extract book titles from HN comments? Should I train NER chunker on HN data?
======
r3bl
There is one project[0] that accomplishes the same thing, but instead of
searching for the titles, it searches for the links to online stores (like
Amazon) and then checks if the links are indeed books or not. Might be
relevant to you.

[0] [http://hackernewsbooks.com/](http://hackernewsbooks.com/)

~~~
jackschultz
Ha, that's hilarious because I created one that searches all Reddit comments
for products from Amazon[0]. And I'd always been thinking of moving toward
searching for all proper nouns, but obviously that's more difficult on this
front. So if anyone has suggestions about whether or not to expand and in a
certain way, let me know!

[0] [http://www.productmentions.com/](http://www.productmentions.com/)

~~~
shanecleveland
This concept I believe represents a good enough sample size to provide a good
overview of "most commonly linked-to items." If 100 people link to the same
product across many domains, chances are that most of those will be Amazon
links. You don't get all of the links, but you get most of them.

The exception will be items from sellers of products not available on Amazon.
And there may be some communities in Reddit that are more loyal to another
e-commerce platform or site you would miss (crafters to Etsy, antique
collectors to eBay, etc.)

------
_lpa_
I did this a while ago (www.hnreads.com, bookbot.io). Manually labelled ~2k
comments, trained a NER system, then validated the titles via amazons api. Was
pretty easy once I had manually labelled comments. Don't recall my
F1/precision/recall scores - they were ok but lower than the state of the art
reported in papers.

~~~
jmstfv
That's interesting. Which labels were you assigning to them?

~~~
_lpa_
I had a macro in emacs that wrapped the highlighted text in some xml tag (say
<book></book>). Processing that I could label it however - e.g. IOB or
whatever you fancy. The labelling didn't really take that long to do, maybe a
few hours over a couple of days.

------
ar7hur
I would do something like this:

1\. Get a few thousands book titles that are "long enough" so that the
probability they appear in a sentence without meaning the book is low

2\. Search these spans in HN comments and use this corpus to train Stanford's
CoreNLP NER

3\. Run the NER on all comments

4\. Check on Openlibrary or another book DB that the extracted spans are real
books titles

~~~
mfalcon
This is known as bootstrapping or semi supervised learning, just in case you
want to look for some theory behind it.

~~~
achompas
How so? I don't see reference to sampling with replacement or a suggestion to
re-use the unsupervised results to further improve the model. Seems like I'm
missing something...

~~~
mfalcon
You're creating a small classified dataset without manually labeling them, in
order to train a supervised learning model.

I'm not an expert, but I think that a boostraping technique doesn't imply
continually improvement of the model.

~~~
achompas
Ahh, I see. I was confused about whether semi-supervised approaches rely on
using predictions on the unlabeled data to improve model performance.
Wikipedia seems to suggest this is a key component which isn't mentioned in
OP:

> Semi-supervised learning may refer to either transductive learning or
> inductive learning. The goal of transductive learning is to infer the
> correct labels for the given unlabeled data.

Agreed on bootstrap, but in the proposed approach you're not artificially
expanding your sample size by sampling with replacement.

------
kuboris
I built the simmilar stuff for Reddit using Amazon links.
[http://booksreddit.com](http://booksreddit.com)

I'm in the process of adding Goodreads and other book websites to get better
suggestions.(Using Amazon API is limiting on scale)

As for your question: I've been researching python natural language processing
to get probable named entities and check them against Amazon API. I suggest
[https://spacy.io](https://spacy.io) that has reasonable named entities
extraction. However doing it at a scale might produce lot of books that are
named as a common phrases.

~~~
jmstfv
Haven't heard about spacy, been struggling with Stanford's NER for a few days.
Looks great, will check it out. Thanks!

------
garysieling
I've found experimentally that doing a search on amazon for a book title, even
if you're only close, almost always turns up that exact book. However, the
product API has really restrictive terms.

I would look at the OpenLibrary dataset - you might be able to match titles in
there to comments, or use it to validate the NER output, if you don't want to
go the Amazon link regex route.

The entire dataset is available for download, or you can build a prototype
with their API - I did this to map speakers to books with
[https://www.findlectures.com](https://www.findlectures.com) (you can see it
if you hover over a name - e.g.
[https://www.findlectures.com/?p=1&speaker=-Barack%20Obama](https://www.findlectures.com/?p=1&speaker=-Barack%20Obama)).

------
shanecleveland
A pretty common structure in comments:

"name of book" by "author_first_name author_last_name"

If you could determine the most common words written before the title begins
(read, liked, loved, recommend, etc.) you could probably parse out a lot of
titles plus their authors.

------
Eridrus
You would probably be better served starting with an ISBN database and just
looking for spans that match book titles. Not sure if there's one you can
download for free, but I saw one on sale for $675.

NER will give you a lot of entities that you need to resolve against something
to check if they're actually books anyway, depending on volume you may not
find a free API tier.

~~~
jmstfv
I was thinking about querying Google Books API to check if NNPs are book
titles/authors, however, as you mentioned, NER produces a lot of entities.
Luckily, there is Open Library dump[0] which has around 16 million book titles
(IIRC) with some metadata.

[0]
[https://openlibrary.org/developers/dumps](https://openlibrary.org/developers/dumps)

~~~
thakobyan
I've been using Amazon Product API for Booknshelf (www.booknshelf.com) by
passing "category => book" in the search to query only on Amazon Books. The
book search by title is quite relevant. I thought this might be useful!

------
demonshalo
One problem that you will run into is that X (the phrase in the comment) might
be a permutation of a book title. Or perhaps even a title of a book being
mentioned in a comment where the author did not intend on it being interpreted
as such. This is mostly due to ambiguity & information density in language as
well as how we assign names/titles to things.

Ex. If I write a comment saying "The number of our clients went from zero to
one instantly", in this sentence, "Zero to One" will match Peter Theil's book.

If you are willing to put up with such "noise" in the output, then you don't
need to train a thing or even use ML. Just chunk the given piece of text and
look up all the tokens permutations of given length (from 1 up to N) in your
SQL Entities Database.

You can generally find this kind of Entities Databases by aggregating a lot of
datasets from all over the web or even use something like google books
datasets: [https://books.google.co.uk/](https://books.google.co.uk/)

[https://storage.googleapis.com/books/ngrams/books/datasetsv2...](https://storage.googleapis.com/books/ngrams/books/datasetsv2.html)
\- This is an ngram set but there should be one with only book titles
somewhere. Can't find it atm but you can search for it yourself!

You could even (if you are brave enough) try to use wikipedia's dumps to mine
book titles from articles

