
Extracting Structured Data from Recipes Using Conditional Random Fields (2015) - yoloswagins
https://open.blogs.nytimes.com/2015/04/09/extracting-structured-data-from-recipes-using-conditional-random-fields/
======
cs702
This is the kind of problem for which LSTM RNNs -- and more recently, fully-
attention-based deep neural nets -- produce state-of-the-art results.

I wonder if the author ever tried using, say, an AWD LSTM RNN[a] or a
Transformer-like model[b] for this task.

Using an RNN or an attention model for this would eliminate the need for
feature engineering such as:

    
    
      feature_1 = 1 if x_t is capitalized and y_t equals "NAME";
                  0 otherwise.
    

This is one of seven carefully engineered feature functions listed in the
article, and the author states that the seven are only a partial list.

Moreover, using a modern RNN or attention model likely would produce better
predictions, with much better generalization.

[a] [https://arxiv.org/abs/1708.02182](https://arxiv.org/abs/1708.02182) /
[https://github.com/salesforce/awd-lstm-lm](https://github.com/salesforce/awd-
lstm-lm)

[b] [https://arxiv.org/abs/1706.03762](https://arxiv.org/abs/1706.03762) /
[https://github.com/tensorflow/tensor2tensor](https://github.com/tensorflow/tensor2tensor)

~~~
nl
This article is dated 2015. Can’t blame the author too much for not trying
things that would be invented 2 years later.

But yeah, it would be great follow up work.

~~~
cs702
Ah, I didn't notice the article's date until now. Thanks for pointing that
out! Makes more sense now.

Yes, it would be great follow-up work.

------
Moru
So now you can autoconvert all cups to deciliters for usage in other
countries? :-)

I have a recipe collection since I bought my first Atari 30+ years ago. I
always tried to stick to text-files written with a certain style in mind for
later processing into a database but haven't gotten around to actually do it
yet. Did a half-hearted start on in last week again while I was sick but since
the textfiles work already as they are, I stopped again.

Usually the ingredients are written "1 dl water", nothing fancy like their
problems so would be very easy to parse with just some php.

~~~
Freak_NL
> So now you can autoconvert all cups to deciliters for usage in other
> countries? :-)

In my experience converting between cups/tablespoons/teaspoons and metric
units only complicates recipes. It's easier just to get a set of measuring
spoons and cups in addition to a set of scales and metric measuring cups.
Besides, even though cups may not be used as much in the metric world,
tablespoons and teaspoons and their fractions are, so most home cooks already
own a set of measuring spoons at least.

Of course units like _floz_ and _oz_ do need converting to normal volumetric
and weight units.

~~~
thiagocsf
Where did you buy your measuring cups for oven gas mark 5?

~~~
Moru
No matter if it has degree settings or not, you still have to calibrate your
oven. A gas mark 5 is just as exact as "medium heat". You need to use your
experience to calculate or just guess.

------
mark_l_watson
A three year old article, but worth reading: NYT solves a pain point using NLP
and we’re nice about open sourcing their work.

BTW, if you sign up properly, the NYT has an API for collecting articles by
topic - also useful for NLP research. I used this API a few years ago.

------
js2
The code discussed in the post is here:

[https://github.com/NYTimes/ingredient-phrase-
tagger](https://github.com/NYTimes/ingredient-phrase-tagger)

------
magoon
The value of converting media content into structured data is underrated.

I imagine the future market is vast for media companies to offer their large
stores of content this way.

Kudos to NYT. I’m often impressed with their technical contributions.

~~~
jonnydubowsky
I agree wholeheartedly! While NYT subscription revenue has picked up in recent
years, I wonder if more newspapers and publishers might find significant value
in building more of these knowledge based software products? Structuring the
data appears to be the key process to make these products effective. Does
anyone know of any other examples similar to this effort?

------
tclancy
>recipes that users can search, save, rate and (coming soon!) comment on

That would be a fun NLP follow-up: mark or remove recipe comments that follow
the general pattern "I substituted a for b, c for d, didn't bother with x
because I never buy it, broiled the whole thing instead of pan frying and it
was TERRIBLE!"

------
speps
Please add (2015) to the title.

~~~
sctb
Updated. Thanks!

------
chatmasta
Cool project and good write up. This CRF reminds me a bit of definite clause
grammar (DCG) from prolog. Would be interesting to mix the two using some sort
of probabilistic predicate logic.

------
camkego
Would anyone in the know care to share pointers to other current state of the
art ingredient structured tree extractors, either open source, or proprietary?

------
wenc
This is the first time I've seen MathJax used at a mainstream site like
nytimes.com. Good for them

------
azinman2
Should have a (2015)

