
Show HN: An API for scraping recipe web pages - brad0
http://choppingboard.recipes/
======
oneeyedpigeon
This really seems to be an exercise in 'structuring recipe data' rather than
the ins-and-outs of scraping. Seems like a much-needed task; is there anything
approaching a 'standard' for recipe data already?

    
    
        "ingredients": [
            "600g pineapple, peeled, chopped"
        ]
    

This seems like a prime candidate for improvement; something like the
following would seem to be more useful:

    
    
        "ingredients": [{
            "ingredient": "pineapple",
            "quantity": {
                "value": 600
                "unit": "g",
            },
            "preparation": [ "peeled", "chopped" ]
        }]
    

Using this kind of data, I can imagine automated ordering, management of
several planned recipes in conjunction, personalised timing estimations, etc.
I think the whole field of cooking is ripe for some quality api-based
disruption!

~~~
Freak_NL
Whenever I read a recipe online I get this nagging feeling that there should
be some widely used open standard for describing recipes that can do all sorts
of awesome stuff (like a web crawler such as DuckDuckGo being able to answer
"I have a pineapple, cream, and the usual pantry basics, what can I cook?"),
but this seems to go directly against the business models of the big recipe
sites, so it will not likely see much uptake.

~~~
philbarr
I thought once about doing a "flavour graph" where people could go on and add
what they thought were good flavours to mix together. The more people added
the same flavours, the stronger the edges between nodes would get. I'd
probably have to separate it somehow by geographic region, as tastes vary, but
in the end you should have a good approximation of "what goes with what".

~~~
blackskad
There's a Belgian startup that matches flavors with a scientific method.
[https://www.foodpairing.com/en/home](https://www.foodpairing.com/en/home)

------
Freak_NL

        "prepTime": {
            "text": "15 minutes",
            "iso": "PT15M",
            "minutes": 15
        }
    

Why are you triplicating the time periods? I would limit that to just the ISO
8601 value (PT15M) and let the UI choose the proper rendering of the value.

~~~
oneeyedpigeon
Agreed; also more useful for internationalisation to keep textual descriptions
out of the data, wherever possible.

------
Freak_NL
It looks good and seems to work with some arbitrarily picked recipes on the
usual large recipe websites, although more obscure links cause some bugs¹. Is
the source code on GitHub? Are you handling specific websites such as
allrecipes.com with bespoke code?

Technically the project is interesting, but if you want to offer a commercial
API you might run into copyright and fair use issues (as with any scraping
tool). Not so much a problem for personal use, but expect angry letters from
the major recipe websites for violating their terms of use (i.e., this is a
threat to their business model).

1: Try this link: [https://www.thespruce.com/peking-duck-
recipe-694920](https://www.thespruce.com/peking-duck-recipe-694920) . You'll
see a lot of garbage HTML and XML entities in the JSON that can be filtered
and replaced fairly easily.

~~~
oelmekki
I'm actually amazed it manages to scrap minor websites. Scrappers usually have
to take in consideration every single website they're targeted at, and they
break each time there's a major redesign (or sometimes, a minor one).

@brad0 : how did you manage that? Was thespruce.com in your targets?

~~~
tinebak
I don't know about this scraper, but one thing that mine does
([http://www.copymethat.com](http://www.copymethat.com)) is to "read" the
complete page looking for certain word combinations that indicate ingredients
or steps. It also considers styles and location on the page and looks for
keywords that tend to start or end a recipe. It then picks what it considers
to be the strongest recipe on the page. This means that it can pick up some
weird things if the page doesn't actually contain a recipe; It really wants to
find one!

~~~
oelmekki
Interesting approach, thanks for mentioning it. I guess it means you have a
lot of unsuccessful results? Do you try iterate several times on the same page
to find different possible sources for a given info and rank them, or is it
something more like "if we're not confident enough, forget about that info"?

~~~
tinebak
There are hardly any unsuccessful results. (Assuming that the page actually
contains a recipe.) People have copied recipes from more than 70,000 websites
into their recipe boxes. Of course, I can't check that all the millions of
recipes have been accurately copied, but we do check a lot and also get
terrific feedback. The parser first goes through all lines/sentences on the
page and gives them a rank based on whether it seems to be an ingredient or
step. Then it looks at groupings (several steps together) and then the
placement of the ingredients compared to the steps.

------
polm23
Obligatory mention of the NYTimes using a CRF to get structured data from
recipes:

[https://open.blogs.nytimes.com/2015/04/09/extracting-
structu...](https://open.blogs.nytimes.com/2015/04/09/extracting-structured-
data-from-recipes-using-conditional-random-fields/)

------
borne0
This is sorely needed, so many recipe blogs follow that tired format of 'long
diatribe about something moderately health or family related' then the recipe.
Cool!

------
keithasaurus
I can't access the linked page, but I run
[https://www.cinc.kitchen](https://www.cinc.kitchen). It tries to be the
GitHub for recipes, but also has some of the most advanced scaling and parsing
available. So a lot of thought has gone into how data is organized and
processed.

On scrapers specifically, cinc's recipe importer is decent, but mainly relies
on structured meta data.

There's a lot of room to improve these tools though. Lots of complexity and
edge cases with recipes :)

Happy to answer questions people have about this stuff.

------
mstaoru
Good work! Parsing hRecipe and Schema.org Recipe entities is what we also did
before for Spiceship. Unfortunately, the quality of recipes from the Internet
is unspeakably low, so we had to switch almost entirely to parsing e-books.
Some websites like SeriousEats are better than the others, but generally, it's
not serving any purpose except aggregating the recipes Yummly-style and
getting some kind of data insights from them.

------
moepstar
Not sure if this is supposed to work, but german recipe site "chefkoch.de"[1]
gave me a "Unexpected token in JSON at position 1611" error.

[1] [http://www.chefkoch.de/rezepte/565001154855998/Big-Kahuna-
Bu...](http://www.chefkoch.de/rezepte/565001154855998/Big-Kahuna-Burger.html)

------
dbot
Related, there's an iOS app called Mealboard which lets you plan out recipes
for the week/month/etc. We've started using it all the time, mostly because it
has an amazing in-app browser that scrapes web recipes and stores them in the
recipe list. Really impressive tool.

------
Shivetya
While I like the idea one of the issues I have run into way too often is the
far to many sites are simply linking to the actual site for the recipe. It is
getting really difficult to filter these results out as new ones crop up
weekly.

example : [https://yurielkaim.com/7-green-detox-juice-
recipes/](https://yurielkaim.com/7-green-detox-juice-recipes/)

fortunately it doesn't take seven pages to find the actual recipes but I still
need to go to each individually and possibly be buried under the ad load. the
worst sites spread out the recipes to their own seven pages and still require
a link to the site holding the recipe

------
joshstrange
I use Paprika for my recipes and IIRC their API for fetching back your recipes
(not public) was actually pretty nice. I have played around with creating a
meal prep blog for a while and I was going to write a little service that
would hit their API so I could make embeddable widgets for the recipe that
pulled directly from my account (so if I updated it in the app the web would
update as well). Not the same thing as this at all but I'd be interested to
see what JSON format they use for recipes.

------
bomdo
Impressively, it also works for languages other than English (albeit having
trouble with formatting here and there).

Is there a behind-the-scenes somewhere? Is this regex magic alone?

~~~
mstaoru
There are two open formats for recipes: hRecipe microformat, and Schema.org
Recipe entity. Aggregators like Yummly parse those. Most food blogs are based
on Wordpress or Typepad, both of which has recipe "editors" as plugins and
produce valid hRecipe definitions.

------
pawelkomarnicki
I came up with my own schema for recipes for CookArr.com :-) but in the long
run, making the recipes is more interesting for me, than scraping them off
other websites :-)

~~~
qrv3w
This is really neat, very nice layout and search!

------
zackify
The first recipe I tried gave me an unexpected json error. Seems to only work
with a few sites. Cool idea though.

------
staticelf
Very impressed, works fine on websites in Swedish even.

------
mapster
use case: Alexa walking me thru preparing a meal for 4. learning my tastes and
diet - making recommendations when asked.

------
Cosmopolitan
I've been using Chrome's scraper extension. This looks like it could be fun to
try out though.

------
Dowwie
This falls under the category of unethical scraping

~~~
JshWright
While it doesn't address your concerns about ethics, recipes are explicitly
not covered by US copyright law (obviously there are other factors involved
here, but that's certainly the big one).

There may be ethical questions here, but there shouldn't be any legal concerns
(IANAL, just a guy in the process of building a site in this space).

~~~
tinebak
Recipes aren't but images always are. Collections of recipes (such as a
cookbook) are also protected.

