This really seems to be an exercise in 'structuring recipe data' rather than the ins-and-outs of scraping. Seems like a much-needed task; is there anything approaching a 'standard' for recipe data already?
Using this kind of data, I can imagine automated ordering, management of several planned recipes in conjunction, personalised timing estimations, etc. I think the whole field of cooking is ripe for some quality api-based disruption!
I tried to build a little recipe DB along these lines years ago. What I ended up finding is that as I improved in my own food preparation skills, my interest in super-rigid recipes of this form diminished.
Basically, there's just a lot of slush and room for experimentation in cooking. Think about making a stir fry— how many people would level off exact masses of different vegetables to combine in perfect proportion? Never! You toss in what looks good and that's that. As you gain confidence, this approach spreads out to cover marinades, sauces, salads, breads, baking, etc, etc. You're not looking at recipes because you intend to follow them exactly, but because you're trolling for ideas of new combinations, or new preparation methods, or whatever.
The prose is valuable, because that's where that subtlety is communicated. It's the author saying stuff like: "I know this sounds weird, but do X until Y and then gently add Z and you'll be surprised what a great flavour develops."
Authors like Mark Bittman and Deb Perelman are perfect examples of this. Sure, you might make the odd thing from them exactly as written, but that's not really why you're there. I think treating food preparation as equivalent to a Vagrantfile would really lose something along the way.
I agree that the much of the value of recipes is as a source of new techniques and combinations. However, I strongly prefer precise recipes vs vague recipes. A recipe should be reproducible as much as possible. I can choose to make changes or combine another recipe based on my priorities. However, vagueness in a recipe adds randomness on top of your changes. If precision for a single element does not matter or varies unpredictably that should be called out. Precision also helps me understand how different recipes can be combined into something new.
If you are just giving me a technique or idea then write the article that expresses that. The recipe should be precise but the real value comes from the prose.
That's fair, and perhaps precision could be part of the structure. The default is "exactly", but there are options for adding according to taste, judgment, mood, whatever, and some way of expressing a suggested range.
The Joy of Cooking's waffle recipe suggests 1/2c butter, but says you can drop that to 1/4c or bump it to 1c depending on how decadently crisp you want the result to be.
Note also that a lot of times the variations are to account for some other inherent variation, like adjusting the sugar in a pie depending on how sweet your fruit is.
"That's fair, and perhaps precision could be part of the structure. The default is "exactly", but there are options for adding according to taste, judgment, mood, whatever, and some way of expressing a suggested range."
Yes, that is ideal. List exactly what you did in your ingredients list. Then add information about what you change and the impact of this. Kenji López-Alt's recipe is the perfect example of this. I prefer starting from the Jacques Torres' recipe but I still go back to Kenji's article. It shows me the impact of moving any of the dials for the cookie I want.
"Note also that a lot of times the variations are to account for some other inherent variation, like adjusting the sugar in a pie depending on how sweet your fruit is."
Yes, but that should be called out.
The key thing is that a recipe should be an actual recipe. It should be precise documentation that allows you to reproduce a product as much as possible. Prose is what actually is teaching you how to cook.
Tbh I can't actually imagine the measurements in a cookbook were ever really intended to be precise, at least in the kind I've seen. Like even 4 cloves of garlic is vague; it'd have to be given in weight, but no one actually reading a recipe would prefer 30g of garlic over 1 clove.
Maybe differs in a restaurant or business, but for home cooking thinking of recipes as precise seems absurd to me. Its 600g... a number that was clearly rounded, and I would bet they eyeballed the amount they wanted and then measured it. Theres too much imprecision in a home kitchen to pretend to be precise
As someone relatively amateur to cooking, this is entirely news to me. Whenever something doesn't turn out right, I just assume I didn't follow the recipe exact enough somewhere along the way, but have no idea where. Outside of "be more precise when baking", I don't know enough to know when I need to follow recipes exactly and/or how much leeway I have for fudging otherwise.
It's actually incredibly frustrating when trying to follow a cook book _because_ they're often so vague. Even things like "4 cloves of garlic" are nerve-wracking because I have no idea how big the average garlic clove is, whether mine is above or below that, how much that matters, in what ways that will affect the taste, how I should accommodate for more/less garlic later, and so on. Garlic's an easy example, but this frustration applies to almost every ingredient -- especially when they're "to taste" or "until brown" (how brown?) or "until ready" and so on.
Give me 600g, give me "stir for 2 seconds", give me detail! If I wanted to go off-book, I wouldn't be using a recipe!
just what do you intend to do? Get a recipe that says 2 pounds chicken leg, 30g garlic, 10 sesame seeds and 4.2 cups yogurt; and then sit down and measure everything to these precise requirements? Realize you only have 4.1 cups of yogurt, and start skimming a little meat of the chicken to toss? It'd take you a day just to prepare the ingredient measurements! Drop some yogurt and you're back to the measuring table
Home cooking is an improvised activity, primarily because its so lenient in what it'll accept. Of course, maybe the french will tell you otherwise, but most food doesn't come with any intent to be 100% reproducible (and it'd be extremely boring if it were), or "designed". They come out leniently tasty, and they look good leniently.
Until brown? Until it looks like a good shade of brown. Until ready? Until it feels like its ready. To taste? Until its tasty.
Its imprecise, because its an ancient activity. You do what seems like a good idea, and don't do it next time if it fails. Its a game where losing is kinda fun (unless you're "designing" food, which I have nothing to say about)
Personally, I don't find "losing" (e.g. making food that isn't good) fun at all, but rather a large waste of time. I would love a recipe that details the quantities of everything I need, and -- yes -- I would (and do) measure down other ingredients when I don't have enough of one, because the minor upfront time cost of doing so outweighs the risk of wasting the entire meal if it turns out bad, IMO.
I recognize that some people enjoy the art of cooking, but I personally don't. I cook to make food so I can survive. Perhaps I just haven't found the right cook books for my demographic.
I 100% agree with you. Perhaps this kind of structure would be better suited to baking recipes? This sort of precision (amounts, timing, even order of ingredients) can be very important when baking.
Even there, I'm not too sure. I've found bread baking to be shockingly tolerant to variation— as long as I do 1tsp dried yeast per cup of water, and about 1.5 cups of water per loaf, it turns out beautifully. And if there's not enough yeast or it's rising slowly, whatever, just give it another hour to do its thing.
Anyway, that's across white, part whole wheat, part rye, and includes all kinds of additives like bran, wheat germ, cheese, herbs, roasted garlic, even tossing in a pot of leftover porridge from breakfast. Basic bread is almost impossible to screw up.
I agree with you on bread. You have to go way off the rails to screw up bread. I was thinking more along the lines of sponges or macarons, things that are a lot more sensitive to variation.
> I think treating food preparation as equivalent to a Vagrantfile would really lose something along the way.
That's a neat idea, though, in a different direction. If you could implement some kind of standard recipe API and build a business around being a broker for commercial kitchens, wherein a customer inputs a recipe that follows a certain schema and an address, and the commercial kitchens respond saying either "no, we can't make that - we don't have the necessary ingredients / we're cooking at capacity / etc." or "sure, here's how much time it would take and here's how much it would cost." You would lose economies of scale found in most commercial kitchens, but customers would gain much more freedom over what they're ordering compared to pick-from-a-list traditional menus. Users could also search for highly rated recipes from around the world and see if somebody local could make it for them etc.
Mark Bittman was the second biggest influencer of the way I prepare food. The first was a college roommate, who literally improvised everything he cooked, as far as ingredients and proportions go, and it tasted amazing every time. It opened my eyes to how subjective cooking can be.
Heh, I once tried to make a DSL to cope with recipes from cookingforengineers.com e.g. their lovely recipe cards [1] . I totally failed to make anything decent!. The combining of multiple steps etc. was just painful!. In the linked site above, that recipe is so straight forward, I'd love to see how it copes with [1] below.
Those cards are genius. They perfectly illustrate what you can do parallel to other things rather than having to re-read and re-interpret the written instructions continuously.
Whenever I read a recipe online I get this nagging feeling that there should be some widely used open standard for describing recipes that can do all sorts of awesome stuff (like a web crawler such as DuckDuckGo being able to answer "I have a pineapple, cream, and the usual pantry basics, what can I cook?"), but this seems to go directly against the business models of the big recipe sites, so it will not likely see much uptake.
I thought once about doing a "flavour graph" where people could go on and add what they thought were good flavours to mix together. The more people added the same flavours, the stronger the edges between nodes would get.
I'd probably have to separate it somehow by geographic region, as tastes vary, but in the end you should have a good approximation of "what goes with what".
This implementation looks better than the ones commonly found on German recipe sites. As soon as you start adjusting they show you "3/8 pinches of salt" or "5/6 eggs".
This is something I'm working on. I started with a nutrition graph with endpoints for individual foods, units of measure (with mass conversion for volumetric and common units), physical changes (cutting, blending), chemical changes (cooking, fermenting), and nutrient stats.
It's designed as the backend to an NLP interface for describing your recent eating. A food selection composes those endpoints I mentioned; and a recipe is a preset group of food selections. I'm not sure if that's how it should be long-term or how recipe discovery will work, but the goal is an agent that you can talk to about your nutrition intake, food buying, and cooking.
Is your repository public and/or are you willing to share? I am not looking to make this into a business myself but as a guy who is very frustrated that he can't properly organize his diet, I really want to use something like this.
Hell, if it can be improved to be usable enough and to partially replace a personal nutritionist, one might sell access to it one day.
The plan is to put out an alpha release this month. In the meantime you might check out https://www.eatthismuch.com/.
I'm not that experienced with databases; so exposing the design might be unwise from a security perspective. I'm warm to the idea of open sourcing in the future though. I've only shared repo access with a few trusted engineers. If you think you might want to contribute, hmu ryan at terra dot farm.
Unfortunately I am way too busy lately and I feel I would only leech and not contribute, so better keep me out for now.
I have a very keen interest in the area however! What is most interesting for me recently is how exactly is such data represented in computer structures? Since I know that drawing example graphs and giving good examples would swallow huge amounts of time, I asked if the code is open-source so I could reverse-engineer the data model myself and not pester anyone. =)
I fully understand you not wanting to OSS it yet. I'll be extremely interested if you do that one day in the future.
What I'm missing the most in this context is for how many people those ingredients actually are. That's necessary if you want to scale the recipe to fewer or more people.
Yes. As someone with limited cooking skills, I don't think it's quite as simple as 'double quantities of everything to serve 4 people instead of 2'; I'm not sure if it's possible to define a simple formula for how different ingredients scale, nor whether that would need to take into account other ingredients, etc. Would be interesting to see quite how 'algorithmic' cooking can get!
> "It usually is though. Cooking times however don't tend to scale predictably (except for microwaves)."
Not sure I agree. In my very limited experience, I have found that cooking times for microwaves do not scale predictably. For example warming up 2 packs of something seems to use less than double the amount of time for 1. I have had issues in the past of doubling or tripling the cooking time for double/triple the quantity and having food overcooked.
> warming up 2 packs of something seems to use less than double the amount of time for 1
This makes sense. When you see a microwave rated as 750 W, that's assuming there is a sufficient load inside the oven that can absorb all that RF energy. Usually there's not, maybe your food can just absorb 400 W, and when you double the amount of food maybe it can absorb 600 W.
Mainly this is because the outer ~3 cm of food is what absorbs RF energy, heating of the rest of the food is through thermal conduction.
It's kinda weird that it's not included yet. The schema.org spec has a field 'recipeYield' specifically for this purpose and it's present in the meatball example on the site. It should be quite easy for the author to add it.
Well, as surprising as it might sound there seem to be many units in the kitchen which are not so easy to convert. One example: How many milliliters does a table spoon have? Google says 14.7868 (american table spoon). I have read from 6 to 25... I mean the problem is that the people who writes those recipes use different definitions of the same unit. And that is just one example. How should you convert units which are so different between different authors?
1 US customary table spoon (~14.8ml) is conventionally taken as 15ml, a teaspoon as 5ml, a cup as 240ml. That's roughly what you get when you buy a set of measuring spoons/cups as well; some may be a tad smaller (following the US measurements), and some are aligned with these metric values — the difference is mostly negligible.
These are quite standardised unless you are using something like Australian tablespoons¹ (but then any American following that recipe would run into problems as well). The US customary cup and spoon measures were introduced exactly to solve the problem of measurement ambiguity in recipes! Nineteenth century cookbooks can be hard to follow because of the significant variation in the interpretation of smidgens, dashes, pinches, and (then) spoons and cups.
1: Are these really a thing? Any Australian care to chime in?
I'm Australian, a 'standard' tablespoon is 20ml, but recipes sometimes use 25. Not sure about the rest of the world. It probably doesn't actually matter too much for most recipes; the ones that need that level of accuracy tend to specify weights.
I created http://www.recipastely.com/ to simplify recipe conversion between us and iso units a few years back. It's not perfect but it seems to work fine for me.
For unit conversion to metric, I made a browser plugin: https://github.com/falk-hueffner/metric-cooking
It can also convert from volume to mass, e.g. annotate "3/4 cup plus 2 tablespoons packed light-brown sugar" with "[190 g]". It's a bit hacky (basically a giant regexp) but works nicely for me.
I made a recipe scraper that doesn't look at the structured (schema) data, but instead "reads" the whole page to detect the recipe. I found that there are too many recipe sites that don't use the schema rules including, of course, recipes from Facebook posts and emails. Users have used it to copy recipes into their recipe boxes from more than 70,000 different websites. There are a lot of different formats out there!
That's funny! I'm not seeing any pizza recipes now. It shows the latest shared recipes that have images. There are trends, though. One person wrote and complained that vegans were taking over the website, but the next day the Community was back to having lots of meat. The app got a bad review in the app store because this person said that Copy Me That was always showing pot recipes!
Well not sure what it might be, but as the previous poster got pizza all over i'm into salmon apparently, in 2 pages and 95% recipes are all about salmon :)
How weird, it IS all salmon! :) The front page can have a bit of a theme if someone quickly copies in a lot of similar recipes, for example, if he/she is looking for the best salmon recipes from around the web. However, the front page is limited to three recipes from any given person's recipe box. The front page might also show several copies of the same recipe if that recipe suddenly becomes very popular. However, then the recipes would all be showing the exact same salmon recipe. But this is lots of different people copying in salmon recipes from lots of different websites! The world has gone bonkers for salmon.
The front page (Community page) looks different if you're logged in because it updates more often. Maybe I need to re-think the time frame because now the website is stuck on all salmon for two hours for non-logged in users (unless they do a search).
The biggest problem with it, imho, is the lack of a proper definition of ingredients. An ingredient is just a plain string containing the unit, amount and name and sometimes an extra note. Having a quadruple instead of a string would make this standard a lot more useful.
I was surprised to find that bbc.co.uk/food uses the Recipe schema. Made scraping it much easier when there was some talk about canning it a couple of years back.
> I was surprised to find that
> bbc.co.uk/food uses the Recipe
> schema
Gotta say, having worked at the Beeb, this doesn't surprise me at all. Amazing what marginal-value technical itches can be scratched when commercial pressure is eased off...
Note that industrially this already exists - the EPOS company I work for has quite a sophisticated recipe, stock control and re-ordering solution. In that context, one important thing to handle is variations ("no cheese" etc)
I once hacked together some very basic code to try and do this, so that I could answer the question "what can I make with the stuff in my refrigerator".
I gave up on the project pretty quickly (now I'm really tempted to pick it up again), but you can definitely get 90% of the way there with 10% of the effort:
ingredients = [
'200g of heavily salted butter',
'six bottles of beer',
'50ml of clotted cream',
'plain brown flour',
'1 oz french cheese',
'8 large eggs',
'2kg of salted pork',
'a pinch of salt',
'a tablespoon of honey'
]
ingredients.map { |i| IngredientParser.parse i }
# => ["butter",
# "beer",
# "clotted cream",
# "flour",
# "cheese",
# "egg",
# "pork",
# "salt",
# "honey"]
I'm working on this and finding that it's a lot of work (and code) to identify the actual ingredient. Consider "all-purpose sifted flour." That's the same ingredient as "plain flour." So (so!) many ways to write the same ingredient. I'm getting there, though :)
MyFitnessPal has the same issue where the quantity values mixed between different units, but also between volume(cup) vs weight(grams). So its a pain in the neck to know how heavy a cup of pasta is.
For a lot of the basics (flour, sugar, rice, etc.) conversion formulas/tables exist, but I've found that if you use recipes from around the globe, it's much easier to just have measuring utensils for cups/spoons (US cooking volume units) alongside the standard metric volume measures and scales in your kitchen.
At least some of MFPs ("verified") raw ingredient don't seem to use those though. 100g of Avocados, for example, can easily have several thousands of kcal according to them.
A cup of pasta doesn't even compute, since the mass of the pasta will depend so much on the exact shape in question.
You can't substitute a cup of dried tagliatelle (huge "balls of yarn") for a cup of risoni (rice-sized grains of pasta) and be happy after eating that meal. :)
That's the beauty of 'user-generated content'. You can scan whatever you want, and if it's not in their system you can just put the details in yourself. That includes choosing what a portion is
Why are you triplicating the time periods? I would limit that to just the ISO 8601 value (PT15M) and let the UI choose the proper rendering of the value.
It looks good and seems to work with some arbitrarily picked recipes on the usual large recipe websites, although more obscure links cause some bugs¹. Is the source code on GitHub? Are you handling specific websites such as allrecipes.com with bespoke code?
Technically the project is interesting, but if you want to offer a commercial API you might run into copyright and fair use issues (as with any scraping tool). Not so much a problem for personal use, but expect angry letters from the major recipe websites for violating their terms of use (i.e., this is a threat to their business model).
I'm actually amazed it manages to scrap minor websites. Scrappers usually have to take in consideration every single website they're targeted at, and they break each time there's a major redesign (or sometimes, a minor one).
@brad0 : how did you manage that? Was thespruce.com in your targets?
I'm also curious what the approach was to being able to scrape sites in a way where the code is not coupled to a known, fixed markup structure using css selectors, xpath etc.
What are your plans for the project? Will you be open sourcing the code on GitHub?
I don't know about this scraper, but one thing that mine does (http://www.copymethat.com) is to "read" the complete page looking for certain word combinations that indicate ingredients or steps. It also considers styles and location on the page and looks for keywords that tend to start or end a recipe. It then picks what it considers to be the strongest recipe on the page. This means that it can pick up some weird things if the page doesn't actually contain a recipe; It really wants to find one!
Interesting approach, thanks for mentioning it. I guess it means you have a lot of unsuccessful results? Do you try iterate several times on the same page to find different possible sources for a given info and rank them, or is it something more like "if we're not confident enough, forget about that info"?
There are hardly any unsuccessful results. (Assuming that the page actually contains a recipe.) People have copied recipes from more than 70,000 websites into their recipe boxes. Of course, I can't check that all the millions of recipes have been accurately copied, but we do check a lot and also get terrific feedback. The parser first goes through all lines/sentences on the page and gives them a rank based on whether it seems to be an ingredient or step. Then it looks at groupings (several steps together) and then the placement of the ingredients compared to the steps.
This is sorely needed, so many recipe blogs follow that tired format of 'long diatribe about something moderately health or family related' then the recipe. Cool!
I can't access the linked page, but I run https://www.cinc.kitchen. It tries to be the GitHub for recipes, but also has some of the most advanced scaling and parsing available. So a lot of thought has gone into how data is organized and processed.
On scrapers specifically, cinc's recipe importer is decent, but mainly relies on structured meta data.
There's a lot of room to improve these tools though. Lots of complexity and edge cases with recipes :)
Happy to answer questions people have about this stuff.
Good work! Parsing hRecipe and Schema.org Recipe entities is what we also did before for Spiceship. Unfortunately, the quality of recipes from the Internet is unspeakably low, so we had to switch almost entirely to parsing e-books. Some websites like SeriousEats are better than the others, but generally, it's not serving any purpose except aggregating the recipes Yummly-style and getting some kind of data insights from them.
Related, there's an iOS app called Mealboard which lets you plan out recipes for the week/month/etc. We've started using it all the time, mostly because it has an amazing in-app browser that scrapes web recipes and stores them in the recipe list. Really impressive tool.
While I like the idea one of the issues I have run into way too often is the far to many sites are simply linking to the actual site for the recipe. It is getting really difficult to filter these results out as new ones crop up weekly.
fortunately it doesn't take seven pages to find the actual recipes but I still need to go to each individually and possibly be buried under the ad load. the worst sites spread out the recipes to their own seven pages and still require a link to the site holding the recipe
I use Paprika for my recipes and IIRC their API for fetching back your recipes (not public) was actually pretty nice. I have played around with creating a meal prep blog for a while and I was going to write a little service that would hit their API so I could make embeddable widgets for the recipe that pulled directly from my account (so if I updated it in the app the web would update as well). Not the same thing as this at all but I'd be interested to see what JSON format they use for recipes.
There are two open formats for recipes: hRecipe microformat, and Schema.org Recipe entity. Aggregators like Yummly parse those. Most food blogs are based on Wordpress or Typepad, both of which has recipe "editors" as plugins and produce valid hRecipe definitions.
I came up with my own schema for recipes for CookArr.com :-) but in the long run, making the recipes is more interesting for me, than scraping them off other websites :-)
While it doesn't address your concerns about ethics, recipes are explicitly not covered by US copyright law (obviously there are other factors involved here, but that's certainly the big one).
There may be ethical questions here, but there shouldn't be any legal concerns (IANAL, just a guy in the process of building a site in this space).