Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: PyIng – Ingredient parser
107 points by CokieMonster on March 28, 2022 | hide | past | favorite | 31 comments
For far to long ingredient parsers been unavailable to the public. Either due to obsene complexity:


Or because of the dreaded paywall:


Wait no longer, I introduce PyIng. An easy to use python package for changing this "2 ounces of spicy melon" into this {name: melon, unit: ounces, qty: 2.0}.


Is it really necessary to have machine learning involved?

While a bit tricky, this seems to be solvable with a bunch of regular expressions/string searches. (At least that's what I'm doing on my personal recipe app.) Filtering out numbers and number words is trivial (compared to training a ML model). The number of units is not that large and thus can be filtered. The rest is the name. This method is also easier to adapt to different languages, as we don't have to create a dataset for each language.

In the example "2 ounces of spicy melon", information is gone missing after parsing ("spicy") and depending on the context may very well be an important part of the ingredient name ("spicy chili peppers"/"mild chili peppers") I would wan't to keep, e.g. for creating a shopping list.

If we're using machine learning for parsing ingredients and want to actually create an added value, I think the goal should be to identify the ingredients from the recipe instructions, generating the list of ingredients.

But please correct me if I'm wrong, I'd love to learn the reasoning for using something as complicated as ML in this case!

Parsing natural language is hard. Writing the rules takes ages and then they still aren't perfect. Statistical modeling has been actively researched to overcome this problem since the 1990s. The "obsene complexity" NY parser OP mentions uses Conditional Random Fields instead of conventional symbolic rules. It seems to tag the words in the text (with custom tags, not just "noun", "verb", etc.), after which picking out the ingredients is fairly straight-forward.

If you want to give it a go, grab a ratpack parser in the language of your choice, pick a dozen recipes, write rules to extract the ingredients, and then apply them to other recipes. I'm fairly sure you'll be surprised.

> If you want to give it a go, grab a ratpack parser in the language of your choice, pick a dozen recipes, write rules to extract the ingredients, and then apply them to other recipes. I'm fairly sure you'll be surprised.

As I didn't unterstand all of the words in your reply without looking them up, I'll definitely give it a try! Thanks

> conventional symbolic rules

CFG like rules, such as used in programming languages. The famous one that opens many introductions to syntax is S -> NP VP. That is: a Sentence can be rewritten to a Noun Phrase followed by a Verbal Phrase. That covers a large part of everyday English. In fact, the previous sentence has an NP + VP, while this sentence doesn't.

> Statistical modeling

Language is very ambiguous. Adding more rules to cover more cases makes it even worse. Collecting statistics is a way to use rules such as above (or more complex variants) and filter out the most probable analysis.

There are other forms of statistical models. E.g., for tagging (assinging categories to words), the Markov chain (often representing n-grams) was quite popular.

> Conditional Random Fields

A statistical method that can take context into account.

> custom tags, not just "noun", "verb"

One approach to adding meaning to lexical and syntactic analysis was having more semantical categories. Early attempts are features such as "living/non-living", "abstract/concrete", etc. for nouns. The NY ingredient parser seems to have categories like quantity, unit, name (the actual ingredient) and comment (such as fresh).

> a ratpack parser in the language of your choice

Ratpack is perhaps too specific a name. The more generic term is PEG parsers. They are flexible and easily programmed, and available for many programming languages; Python might even have one built in (PEP 0617). You can write grammatical rules in code. Set up a notebook, and you'll be able to test your rules against input samples with the click of a button.

I wrote an ingredient parser that is basically just one giant regular expression: https://github.com/falk-hueffner/metric-cooking/blob/master/...

The use case is a bit different (the task includes finding ingredients mentioned in a longer text, and it'll rather not parse something rather than parsing it wrong), but it works fairly OK and can even parse things like "3/4 cup plus 2 tablespoons packed light-brown sugar".

To be honest you might well be able to deal with it without ML and it certainly would be easier to transfer to other languages you are right. There are a couple of reasons though:

- I wanted to see if I could, I hadn't done any NLP before :) - Some ingredient strings can be complex or contain multiple quantities/ingredients e.g. "3 1/2 cups icing sugar, plus 1/2 cup for dusting". "3 ripe avocados or guacamole" - I am interested to see what can be done with the ingredient embedding, perhaps for ingredient substitution - Hopefully the ML algorithm can learn to identify the unit and quantity based on context rather than a fixed list of ingredients

> I wanted to see if I could, I hadn't done any NLP before :)

Yep, that sounds awesome to me. After seing your post I actually thought that this would be a great exercise for learning ML. I'll definitely keep my list of regexes for my personal recipe app, but now I'm inclined to try the ingredient parser with ML myself.

I wrote an ingredient parser a few years ago (for a project I never finished/released); it mostly works.

The way this works is by having a static list of units ("cup", "liter", "gram", "large", etc.) and ingredients ("egg", "eggs", "courgette", etc.) and a bit of logic to combine the two and extract reasonable things from a piece of text. From what I remember it assumes the ingredient name is last, and then tries to find a unit or quantity before this (which can just be "a", or "a large", "150 gram", etc.), but it's been 4 years since I worked on this.

It works mostly well, but it's not perfect. Looking at the list of ingredients it contains things like "guinness or other stout" or "rice vermicelli noodles". Adverbs and descriptions like that are an issue; I think I extracted that list from that NYT recipe parser thing (which isn't perfect, as it's only in American English – it doesn't know about courgettes for example). I think there were some other issues as well, but I forgot. It's 128 lines of Go code, and this includes a bit of code to modify the text to mark things. It's not very complex, and I didn't spend that much time fine-tuning it. However, I agree that you probably don't need ML for this, and can probably get equal or better results without it.

I don't know exactly how well it works compared to other solutions; from what I remember I tried zestful a bit with their online demo, and it worked better than that, but I only ever tested it on some of my own recipes. I never got very good results with that NYT thing at all. It doesn't attempt to split out the quantity with the ingredient name.

The thing doesn't run at the moment so I can't compare it with the demos in here right now. I should probably finish this some day...

Very cool! I run a recipe site[1] as a side project and I definitely want to check this out. I just punted on the whole ingredient parsing problem by storing the ingredient list in a text field, but I'd definitely want to incorporate something like this in the future.

I don't see a license file, unless I'm missing it somewhere. Is it free to use?

Happy to share my database of recipes if it will help.

[1] https://nononsense.recipes

Not the author but the setup script lists it as MIT: https://github.com/whitew1994WW/PyIng/blob/master/setup.py

I like the concept, but no pictures is sort of a no go for me. Thanks for sharing, nonetheless.

working on adding them! A bit tougher since you can't scrape them like you can recipes. I'll start with uploading my own as I can make the recipes and hopefully have users who care to upload as well.

Cool! Do you plan on capturing modifiers like "spicy" or "large" ? Without those, I imagine a lot of dishes won't be quite the same.

That would be nice, but at the moment the algo is at the mercy of the training dataset. The dataset doesn't distinguish adjectives from the ingredient, so this module doesn't :(

Very cool, and long overdue. Any plans to try and scrape some recipes into it? I'm always looking for some way of submitting whatever is left in the larder and getting a list of potential meals.

There's a great app for this! It's called Half Lemons (iOS only).

I'd like to have a go at ingredient substitution next, that takes quantity into account (For instance 2 drops of vanilla essence, might be equivalent to 1/2 a vanilla pod)

Do you have any comparisons to other parsers in terms of accuracy?

Not yet, wanted to get something out there first. The model could do with a bit of improvement anyway.

Very nice! I've built a couple recipe sites along the years and ingredients are so tricky to get right (I don't think I've ever got them right once...). It seems easy at first, but then once you get into the weeds of "a pinch of salt" (what's the quantity of that?) or "onions, chopped to 3cm", sometimes I'm like "ok, let's just make ingredients an array of strings"

This is great! I would also like something like this for exercises (eg. "squats 5x5" ie. exercise reps x sets, in several variations of formatting)

You could consider https://traindown.com/ which is "Markdown for training data"

Previously mentioned here [https://news.ycombinator.com/item?id=25884762]

Like machine learning algorithmic art pieces, could we now make algorithmic recipes?

Would be fun to see what comes out the other side

Great question. Models for generating plausible recipes have existed in some form for about a decade now.

Arguably the first "successful" attempt at this was Chef Watson, which blew my mind when it was first released in 2014 despite its well-documented tendency to suggest all kinds of spectacularly odd combinations of flavors and ingredients, like garlic ice cream and mayonnaise-spiked Bloody Marys[1].

It's worth noting that preprocessing the textual inputs isn't entirely necessary to produce somewhat reasonable, ML-generated recipes. For example GPT-3 is capable of generating fairly interesting zero-shot recipes, despite having been trained on raw text data without any preliminary feature selection to label (e.g.) a recipe's ingredients.[2] Still not exempt from the occasional wacky, whimsical suggestion[3], but I, for one, wouldn't want my ML-generated recipes any other way.

1. https://www.google.com/amp/s/www.newyorker.com/magazine/2016...

2. https://github.com/LARC-CMU-SMU/RecipeGPT-exp

3. https://thenextweb.com/news/ai-generated-recipes-three-cours...

Garlic Ice Cream is actually a thing. Admittedly the only place I’ve seen it is at a garlic festival.

There's a store or two in downtown Gilroy, garlic capitol of the world, that often make, stock, and serve it.

Gilroy's on US-101 an hour south of San Jose.

My last visit, before that last Vandenburg launch, yielded nothing.

The Stinking Rose in SF has it (https://thestinkingrose.com/sf/menu2.html).

There was IBM Chef Watson

I'm writing something more generic to do this (for anything): https://lxagi.com

Problems I didn't know people had...

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact