
Scraping Recipe Websites - benawad
https://www.benawad.com/scraping-recipe-websites/
======
selecsosi
I highly useful tool in my household for dealing with the SEO/tracking scourge
that recipe blogs have become is
[https://www.paprikaapp.com/](https://www.paprikaapp.com/).

Hoping someday to have some spare time to integrate this with
[https://grocy.info/](https://grocy.info/) and have a pipeline for recipe ->
preparation automation.

~~~
vitiell0
You might want to checkout an app I built called Cooklist. It has the features
of Paprika + Grocy + Instacart + Pinterest all in one.
[https://cooklist.co](https://cooklist.co)

~~~
SpyKiIIer
works in Canada?

~~~
vitiell0
Only in the US at the moment but planning for Canada later this year

------
julianlam
I'm surprised nobody has mentioned "Recipe Filter"
[https://addons.mozilla.org/en-CA/firefox/addon/recipe-
filter...](https://addons.mozilla.org/en-CA/firefox/addon/recipe-filter/)

Cuts the fluff and puts the recipe front and center. I wouldn't be able to
find recipes online without this.

------
jedieaston
Paprika 3 (I use the iOS version, but I believe the Mac version has the same
function) has a fantastic web scraper for recipes. I've had to correct maybe
1-2 errors across 100 recipes I've brought in from a bunch of different sites.
It's super helpful to look through them in a standardized way (and you can
sort by ingredient/category) to figure out what to make.

~~~
karatestomp
I think most recipes are published using a microformat that makes this pretty
easy, and that's why Paprika (I use it too!) so rarely screws up.

~~~
reaperducer
Yep. And if Google detects that your page contains a recipe and the microdata
isn't perfect, or doesn't include all the things that Google wants so it can
show your recipe to people without them clicking through to your site, Google
sends you an e-mail through Webmaster Tools telling you to fix it, with the
implied threat that your page won't be listed if you don't allow Google to use
your work for free.

~~~
rrrrrrrrrrrryan
I'm a little torn with this. On the one hand, it's messed up that Google
forces these companies to basically hand over their data, as you put it, but
on the other hand, if they _don 't_ push companies to do things like this,
single-visit webpages like lyrics and recipes inevitably become ad-infested,
SEO-driven trash.

Maybe if they used a carrot in addition to the stick, it'd feel less sleezy,
but I'm not sure what exactly that would look like.

~~~
reaperducer
_other hand, if they don 't push companies to do things like this, single-
visit webpages like lyrics and recipes inevitably become ad-infested, SEO-
driven trash_

The opposite may also be true. If Google sent visitors to these sites instead
of displaying their content without compensation, the sites wouldn't be so
desperate to extract every last penny out of the reduced number of people who
click through to them.

------
kmbfjr
But how will I read about "Dakota", an avid yoga enthusiast who just happens
to be a mom, who enjoys making healthy and savory meals for her family while
blogging?

Seriously, I hope this spells an end to the Google ranking imposed nonsense
that makes the simple act of searching for a recipe so insufferable.

~~~
tmountain
It's a running joke in our house. I start off wanting to make some mashed
potatoes, and time and time again, I have to suffer through someone's life
story--the camping trip in North Dakota when Susan's husband first discovered
his love of homemade sour cream--etc. Makes me wonder if a super barebones
recipe site that literally just has recipes and absolutely no fluff would be
something people would gravitate towards.

~~~
thawaway1837
“Why doesn’t everyone who is putting information out there for free not cater
to exactly my needs”.

What an utter load of bollocks. These are people who are creating something
that they enjoy doing and giving you information you apparently need for free.
I don’t understand why anyone would trash their desire to write something that
is personal and/or interesting to them about it.

And that’s setting aside that these usually make the recipes far more readable
and interesting to the vast majority of people.

~~~
scrollaway
It's a sweet world view, but the truth is that the majority of these cooking
sites are filling in content for SEO and ad purposes, and the "stories" are
fictions written by Tom, a 22 year old freelancer who isn't a yoga enthusiast
but is just trying to make ends meet at 3¢/word.

~~~
thorwasdfasdf
And, the real farce is that Google mistakenly sees all that scrolling up and
down the page, looking for the recipe as 'engagement' and "Dwell time".

~~~
mandelbrotwurst
Is that mistaken? Does no one click on ads on the page while doing this?

~~~
thorwasdfasdf
the algo thinks you're having a good time.

------
memset
Interesting! I wrote [https://plainoldrecipe.com](https://plainoldrecipe.com)
(open source!) to solve this, an inadvertently discovered many of the metadata
tags described here.

The irony is that the content is required for SEO purposes, but once you’ve
landed on the page you don’t want to see it. I wonder if there would be a way
to write SEO that only the google bot sees and hide it from humans...

~~~
phito
Your header says "plan old recipe"

~~~
yepthatsreality
Which is just dripping with irony? serendipity?

------
m_ke
Are there any legal issues with scraping recipe sites in a commercial app like
that?

I'm assuming ingredients and directions are "facts" so can't be copyrighted,
but what about the pictures?

~~~
thinkloop
Scraping is LEGAL, all search engines scrape to some degree for example, there
is a fair use component, so you can't "scrape" 100% of a site and stick it on
your domain, but you can still scrape more than zero. In general it is leaning
more acceptable than less.

~~~
m_ke
Yeah I understand that part, my question is about showing the scraped data to
your users.

[https://www.yummly.com/](https://www.yummly.com/) used to have a paid API for
recipe search and currently still lets users search their index. Did they have
to go and get permission from each site that they index or is it fair use?

~~~
thinkloop
It looks a shade more detailed than google's recipe cards, they link back to
the original source for the instructions, I would bet they didn't get
permission, and that they count as a fair-use search engine. The law isn't
(can't be) perfectly prescriptive here, there's some line that you have to sue
about to know if it has been crossed.

------
stx
This could also be useful for websites that do not print well. I have run into
a few occasions where adds and other website elements printed with the actual
recipe. The result was a small recipe divided on several pages mostly covered
with other content. There were pictures and text formatting that I could not
copy out. Often for stuff like that I just pull the HTML and edit until it
prints well but I would rather have an easier way.

------
WrtCdEvrydy
Here's the question... why is it so difficult to do this in Android?

Seriously, AndroidDriver for Selenium was last updated 2013... and importing
it throws an HttpClient error now. Update that client and you get a class
duplication hell that is impossible to exit.

All I needed was to interact with 2-3 fields on a webpage but it's been eight
hours and now I hate my life.

~~~
openthc
Checkout BrowserStack -- it's dead easy -- and even if you're not using their
platform, their docs are good for showing the Selenium/Driver usage.

------
zwieback
Cool, now the next interesting step would be to categorize recipes, maybe some
kind of clustering algorithm, to see how similar they are and whether they
have a common ancestor.

When I look at a recipe and notice some unusual proportions I usually check
against Joy of Cooking or some other standard book. I've noticed that often
everything old is new again.

------
qrv3w
This is great! Its a wonderful write-up.

I've also made something almost identical - a Go library for recipes scrapers
for ingredients [1] and instructions [2]. Instead of the LCA method here, in
my version I try to find the longest sequence of highest scoring HTML tags and
those are "ingredients" or "instructions". It works very well (although I
think this one works better).

Like the article mentioned, I found that the heuristics for finding HTML
elements with ingredients turn out to be surprisingly simple - they usually
include just a number, a measurement, and a food! This simple heuristic worked
better than other sophisticated things I tried.

[1]:
[https://github.com/schollz/ingredients](https://github.com/schollz/ingredients)

[2]:
[https://github.com/schollz/instructions](https://github.com/schollz/instructions)

------
logfromblammo
The simple truth is that the core recipes are fact-based and non-
copyrightable, and the 1000-word blogspam recipe header is both copyrightable
and garners better search result rankings.

So the business model is to take facts from the public domain, wrap it in
bullshit prose, and then SEO the bullshit to have higher ranking than the
naked source facts, for more unique visitors and ad revenue.

Making comments about "providing recipes for free" are exactly as useful as
comments about "providing phone numbers for free" or "providing mailing
addresses for free" or "providing the original text of 'Little Women' for
free" or "providing the steps of the long division algorithm for free".

Obfuscating the public domain is not a valuable service. Automatically
removing the obfuscation is valuable. A "Project Gutenberg" style repository
of recipes would be recurringly donation-worthy.

------
nicbou
I just started transcribing every recipe I make. Even if you can extract all
the essential information from a recipe site, some changes are needed:

\- I need to convert recipes to metric. I am neither equipped nor inclined to
cook in freedom units.

\- A "can" or a "packet" is not a standard unit of measurement.

\- Package sizes vary between countries. I often adjust recipes to avoid
wasting food.

\- I cook by mass, not volume. I convert the units them round them.

\- Instructions are sometimes too verbose. I make them easier to follow while
my hands are busy.

\- I will make my own changes and I must write them down somewhere.

Besides, sites go down and links break. Food.com broke many of my bookmarks a
few years ago. Other sites went dark. My recipes are plain text. They are
editable, searchable, editable, and available offline.

~~~
tincholio
I wish I had the willpower to do this consistently...

------
mark_l_watson
Hey Ben, thanks for that write up! You may not have time for this, but your
article and the intersection of food/recipes and computer science would make a
good book, at least I would read it.

I wrote [1] about 12 years ago in Clojure because for health reasons I had to
track my intake of vitamin K, then decided to track all nutrients in the USDA
nutrition database. I am working on a semantic web product (with another
semantic product in planning) but maybe the end of this year will get to
rewriting my food web app in Common Lisp and as a macOS app. I am adding a
link to your article and these comments here to my notes for that project.
Useful stuff.

[1] [http://cookingspace.com](http://cookingspace.com)

------
welanes
Neat write-up, and thanks for putting me on to jsonld.js - looks useful.

I'm building [https://simplescraper.io](https://simplescraper.io) and we're
trying to create heuristics to update CSS selectors whenever a website
changes. People become unhappy when a scrape task that ran smoothly on Monday
suddenly returns nothing on Tuesday so while it's a tough nut to crack it's
super important.

We use a combination of XPath, historical data and data type (the value may
change but the type and length often remain the same or similar) to narrow
down the options.

Of course there's more sophisticated methods using Machine learning etc. but
it's fun to try different approaches to solve this problem.

------
Cactus2018
In 2011, Google released "Google Recipe Search". With filtering based on
ingredients, cook time, and calories.

[https://www.wired.com/2011/02/google-recipe-
semantic/](https://www.wired.com/2011/02/google-recipe-semantic/)

[https://latimesblogs.latimes.com/technology/2011/02/google-d...](https://latimesblogs.latimes.com/technology/2011/02/google-
debuts-recipe-view-search-function-for-cooks.html)

------
kevindong
I personally just find recipes, make it as written from the website, and then
(if I actually like it), I'll convert it to be sane for actually following and
output into Apple Notes.

What I mean by that is most recipes call for using wwwaaayyy more intermediary
bowls/plates than actually required (e.g. if spices, chopped veggies, and
minced garlic are going into the pot at the same time, there's no point in
using three bowls) or list ingredients out of order of how you'd actually use
them.

------
peterwwillis
So far the best way I've found to search for recipes is to search in a foreign
language. Translate what you're looking for, then search and translate back to
English. There are still recipe blogs, but 5 instead of 5,000, and usually an
authentic dish, not what Michelle The Stir Fry Queen From Michigan thinks
constitutes a "Moroccan" dish because it has cinnamon and tomatoes.

Would love to see someone put together a search engine that excludes recipe
blogs and penalizes SEO.

------
jangstrom
This is pretty interesting. I wonder how the recipe parsers from MyFitnessPal
or Pinterest compare to this. Sometimes I think they do pretty good, but often
they do miss the mark. My guess is on Pinterest they only treat something as a
Recipe if it contains the metadata mentioned in the article, and do the easy
parse if so. MFP seems to try something a bit more advanced, but I've never
been super-impressed with its parsing abilities.

------
imgabe
This is great. I made a similar product at No Nonsense Recipes
[https://nononsense.recipes](https://nononsense.recipes) because I was also
tired of dealing with all the dreck on recipe sites. I did scrape some recipes
to seed the site with but haven't integrated it as a feature yet.

I did ignore the photos though, since while recipes are not subject to
copyright, photos are.

------
wantacker
Off-topic, but I just wanted to mention that Ben's been one of my favorite
'teachers' in YouTube. He has some quality content on React and JS stuff. For
those wanting to learn React (including some advanced stuff), check out his
channel! And no he didn't pay me to post this here. Hey thanks Ben - I know a
bit of React and have used it on a few projects thanks (also) to you.

------
fulldecent2
I saw all the terrible SEOd recipe websites and my first thought was: I should
make a better recipe website that is simpler and is better SEOd.

\---

FIRST EXAMPLE:

How to cook chicken on a skillet

Step 1 -- get this much chicken [picture]

Step 2 -- cook on skillet for 5 minutes

OPTIONAL -- here are seasonings you may add [pictures]

RELATED:

\- How to cook a lot of chicken on a skillet [LINK]

\- How to fry chicken breast [LINK]

\---

But then I didn't understand how any of these websites are making money so I
didn't do it.

~~~
RhodesianHunter
The reason all of these websites are so terrible with the long winded intro-
stories is precisely because they do better with SEO.

~~~
fulldecent2
Only if the page is low quality.

Leaving a low quality page after 60 seconds is way better than leaving a low
quality page after 5 seconds.

------
thinkloop
Any recommendations for a js lib that does all the "easy" scraping (microdata,
og tags, jsonld, etc)?

~~~
choward
I thought that's what this blog post was going to be about but it's just an ad
for their app. I just need the scraping functionality.

~~~
hundchenkatze
While they do end with pushing their product, I think they did a good job of
outlining how they scrape the recipes. They inform the reader about json+ld,
microdata, and how to scrape the sites that don't use those. They even link to
a JS lib that handles the parsing for you. I think calling it "just an ad" is
inaccurate.

> There are libraries like
> [https://github.com/digitalbazaar/jsonld.js/](https://github.com/digitalbazaar/jsonld.js/)
> to parse JSON-LD + Microdata for you.

------
linsomniac
A surprisingly good UX for recipes is Google Home. Ask it for a recipe, and it
will ask if you want directions or ingredients. If you ask for ingredients, it
will say them one by one, and pause between them until you ask it for the next
one. My son has used it to great effect to make pancakes.

------
aodj
Really nice! I often copy and paste recipes into text files I have locally so
this is a great alternative.

One feature request (if I may be so bold): it would be great to offer an
imperial<->metric convertor. This is predominantly one of the reasons I keep
copies of recipes I find and use.

~~~
samcheng
It's really the conversion to weight (grams, from cups/tbsp/tsp/hogsheads)
that would be valuable. It's just so much easier to clean to stick a scale
under the mixing bowl.

~~~
benawad
this is coming soon!

------
GrantSolar
I've been working on something similar for the past couple of days, but the
trouble comes with wanting static types. There are a few projects out there
that offer either a microdata parser, or types derived from schema.org but
nothing that combines the two as yet

~~~
jeffrogers
I’ve been working on this and will have a recipe-specific solution up in a
couple weeks. See [https://rcpe.io](https://rcpe.io)

------
dsilver
[https://www.eater.com/2020/3/31/21201374/why-are-free-
online...](https://www.eater.com/2020/3/31/21201374/why-are-free-online-
recipes-so-long-stop-shaming-food-bloggers)

~~~
russellbeattie
Ha! I should totally know better, but for a second, I mixed up .com and .net
and thought, "Ben has a blog? And posted about recipes? Did he pull them into
his 8 bit computer or something??" I didn't realize my mistake until I
clicked.

------
franciscop
This is pretty interesting, I wonder if this meta could be reused for
tutorials of any kind (and not only of food, a.k.a. recipes). A tutorial
normally has some requisites, and then step by step guide of how to achieve
it, and then the final result.

------
ben_utzer
I did something similar a while ago. I still have somewhere a DB with half a
million recipes somewhere. I didn't continue it because I got stuck with the
client side and I didn't find anyone interested in helping me.

------
chirau
Is there any recipe tool out there that can do at least one of the following:

1) Scale the quantity of ingredients and cooking time as number of people to
be served increases?

2) Tell me what dishes I can make with the ingredients I have?

~~~
nolroz
I enjoyed the recipe scaling abilities of Gourmet:
[https://thinkle.github.io/gourmet/](https://thinkle.github.io/gourmet/)

You can also filter and search by ingredient, but that might be somewhat
simplistic depending on what you had in mind.

------
gklitt
Pleasantly surprised to learn that most recipe sites include structured
metadata. Makes sense given the combination of a relatively straightforward
schema, and SEO incentive from Google.

------
vadansky
I've been using Tasty. Quick videos showing all the steps and the how it's
supposed to look like along the way. That's the only way I can accept recipes
anymore.

------
monksy
This is pretty awesome. I'm currently working on a data pipeline to
demonstrate recipe scraping with kafka streams. This is going to be a big help
in part of it.

------
brendanmcd
Googling for recipes drove me to install ad-blocker. Have to say I never
considered how google created recipe card featured snippets -- cool stuff!

------
IncRnd
Almost every comment on this page is helpful and from people's direct
experiences. Wonderful :)

Thank you everyone for all of this information!

------
Mela1998
I wish I had this when I first started cooking! I love this concept, but
wouldn't this also harm the creator's traffic???

------
ohhaimarc
This case is a perfect 'recipe' for reinforcement learning. Let me know if you
want help here.

~~~
benawad
I considered going down the ML route, but didn't know where to start. I'd love
to hear how you would approach it.

------
sum2000
Neat! I am interested in developing REST API around it to support more
functionality, wanna collaborate?

------
saadalem
Next level : Shazam for cooking shows.

~~~
sunsetMurk
That'd be sweet for YouTube videos! I watch so many food/cooking YouTube
videos.

~~~
saadalem
That's the point : Similar to Shazam for music, you would build a mobile app
that would use voice recognition to identify the cooking show and give you the
recipe. Additionally you could find a restaurant that makes something similar
and offer to have it delivered for a fee.

------
triyambakam
Hey Ben, if you read this, thanks for your helpful and entertaining youtube
videos!

------
throwaway55554
There is markup specifically for recipes. I wonder why it isn't more often
used.

EDIT: Yes, the article mentions it, but doesn't give a clue why it isn't more
prevalent.

~~~
MandieD
Probably to make stuff like this less effective - the person who posted the
recipe only makes money from it when people actually visit the page and see
the ads.

------
hamilyon2
So, Google actually encourages open semantical web? That is news

~~~
greglindahl
It's been the case for quite a while. However, it's mostly useful if you have
a huge web crawl, elsewise discoverability is a bit poor.

------
SeanDav
Another tool for difficult-to-scrape sites is OCR. There are a few decent
free/opensource options available:

[https://source.opennews.org/articles/so-many-ocr-
options/](https://source.opennews.org/articles/so-many-ocr-options/)

------
partiallypro
Be careful with this, some recipes are subject to copyright law. I think you
can list ingredients of a recipe with no problem, but once you get to exact
measurements and prep it somehow switches over to falling under copyright law.
There used to be a bunch of open sourced recipe repos/databases...but almost
all of them are gone.

~~~
biggestdummy
Among other things, I am a cookbook author, so I know a fair bit about this.

Ingredient amounts are not subject to copyright protection. Any prose -
intros, descriptions, instructions are covered by copyright the same way that
any other book would be. So, yes, this kind of activity is likely in violation
of copyright.

Let me also say that I find it a bit insulting that people who make a living
creating IP (software) would be happy to disrespect the IP of these recipe
authors. By taking the recipes from outside of the revenue source (a book, a
banner ad, a cookie, whatever), you are stealing from the author and the
publisher.

I make 25 cents on every book that is sold. So I don't actually care if you
steal from me. The money is tiny. But it is a bit insulting when people - in
my own living room, reading my copy of my book - decide that they want to take
a picture of a recipe instead of buying a copy. It devalues the hundreds of
hours and thousands of dollars that I sunk into the creation, cooking, and
photography of the book.

So here's the moral of the story. Waiters should leave good tips because they
know that waiters depend on tips. And IP creators should know better than to
steal IP.

------
papadoc
Is this post illegal as it contains information on how to commit crimes such
as copyright infringement?

~~~
jfk13
Is the instruction manual for a photocopier illegal?

