
Extracting Structured Data from Recipes Using Conditional Random Fields - aaronbrethorst
http://open.blogs.nytimes.com/2015/04/09/extracting-structured-data-from-recipes-using-conditional-random-fields/
======
jawns
"But there is an ever-increasing appetite from developers and designers for
finely structured data to power our digital products and at some point, we
will need to develop algorithmic solutions to help with these tasks."

One really cool area in which this sort of algorithm could be applied is
identifying location data.

Imagine an algorithm that could scan through a Times story like this one ...

[http://query.nytimes.com/gst/fullpage.html?res=9902EFDE1230F...](http://query.nytimes.com/gst/fullpage.html?res=9902EFDE1230F933A05752C0A9679D8B63)

... and extract from the text all location identifiers, then geocode them:

"Seventh Avenue and 36th Street" \--> 40.7522877,-73.9897059

"Bleecker Street between Sullivan and Thompson" \--> 40.728887,-73.999566

"Chrystie and Rivington" \--> "40.7212581,-73.99224"

I used to work for a metro daily, and I developed a script that allowed us to
geocode an address by highlighting it in our CMS and clicking a button, but
that still required an editor to highlight the correct portion of the text.

Using an algorithm to perform the task instead of an editor would open up some
incredible possibilities.

For instance, imagine a local news alerts service in which you could enter
your location and a radius, and receive alerts whenever a news item mentioned
a location within that radius. (I once developed a prototype of such a
service, but the lack of a fully automated process for identifying locations
led me to shelve it.)

~~~
discardorama
> One really cool area in which this sort of algorithm could be applied is
> identifying location data.

You may want to try "PlaceSpotter" from Yahoo:
[https://developer.yahoo.com/boss/geo/](https://developer.yahoo.com/boss/geo/)

I haven't tried it myself, but did look at it for a similar idea a while back.

------
denimboy
Weird since I just read this yesterday about the LA times doing the same
thing:

[http://datadesk.latimes.com/posts/2013/12/natural-
language-p...](http://datadesk.latimes.com/posts/2013/12/natural-language-
processing-in-the-kitchen/)

The LA Times article has some generic python NLTK code. They used a MaxEnt
classifier instead of CRF.

------
jsankey
This is really interesting to me as I've just been solving the same ingredient
parsing problem in my iOS app (Zest Recipe Manager) to implement smart
shopping lists. Although I was tempted to use a statistical approach I opted
to start with a more direct heuristic approach to see how far I could get (and
to make sure I really understood the issues before trying a more generic
solution).

The heuristic approach actually works pretty well, though with a significant
amount of effort! A lot of ambiguities can be resolved with a custom algorithm
of this kind. For shopping list support (where really the common cases matter
most) the results are excellent. But there are ambiguities I have had to hack
solutions to that would probably be better resolved via a probabilistic
method. And there are cases where some actual NLP is required to properly
detect extraneous descriptive phrases etc. I'm considering adding a
statistical helper to my custom parser to take it to the next level.

------
sheraz
Funny enough, I'm also working in this space at the moment.

Right now we are training models to identify cuisines and diets in multiple
languages.

Also, anyone interested in this space might also check out Yummly
(www.yummly.com).

