

An open source API for web scraping - owainlewis
https://github.com/owainlewis/falkor

======
owainlewis
An example showing how to grab all the stories from the Hacker News homepage

[https://falkor-
api.herokuapp.com/api/query?url=https://news....](https://falkor-
api.herokuapp.com/api/query?url=https://news.ycombinator.com/news&q=td.title%20a)

------
_jomo
Title should probably contain 'Show HN:' ?

Very interesting though. Just tried scraping twitter and it works great:
[https://falkor-
api.herokuapp.com/api/query?url=https://twitt...](https://falkor-
api.herokuapp.com/api/query?url=https://twitter.com/shit_hn_says&q=.tweet-
text)

Edit: works great as long as there are no quotes, hashtags, or links in the
tweets. Is it possible to include sub-elements?

So basically this is a DOM API in JSON. Simple, but I like it.

Any plans to add JSONP support?

~~~
owainlewis
Hey. Thanks. Yeah I will add a ton of features over the next few days. JSONP
should be an easy one. Feel free to add an issue in Github and I'll get it
done for you.

Only really started hacking around on the idea the other day so early stages.
Want to add filters so you can say "grab me only the text" or "grab me just
the class names". Obviously another step would be to grab multiple elements in
one request.

------
getriver
A better error message would be helpful. For example I tried to do:
[https://falkor-
api.herokuapp.com/api/query?url=https://kodin...](https://falkor-
api.herokuapp.com/api/query?url=https://koding.com/Activity/Public/Liked&q=a),
all I got was "Request failed"

~~~
owainlewis
That's a good point. I pretty much wrote this in an evening or two so haven't
had time to refine it much. But yeah error messages will definitely be
improved. It's because of the way URLs are handled in the underlying web app.
Will be an easy fix.

------
Jake232
Cool idea. This could easily be extended to support something like a proxy
pool; that way you can rate limit / rotate proxies for X domain globally at
this server level. That way it's across all your projects, rather than having
to do it on a per project basis.

Adding xPath support as well as CSS selectors would be a good addition.

~~~
owainlewis
Will definitely do something with caching and rate limiting when I get some
time. These queries are quite expensive so definitely needs a bit of work in
those areas.

------
owainlewis
An example query that extracts all the images from the Digg.com homepage.

[https://falkor-
api.herokuapp.com/api/query?url=http://digg.c...](https://falkor-
api.herokuapp.com/api/query?url=http://digg.com&q=img\[src\])

------
curiously
Pretty interesting. Wrote a web scraping api you can paste in to your browser
and download results last year but took it down to work on another project.
You can take look at what a url could look like.

[https://web.archive.org/web/20140420162639/http://scrape.ly/](https://web.archive.org/web/20140420162639/http://scrape.ly/)

For example if you wanted the profile of authors of today's stories

    
    
        http://scrape.ly/s/{http://news.combination.com}
        {'ueoma87'}*{'next':'Next Page'}{'karma':'331', 
        'username':'ueoma87'}
    

Would've returned all the profiles of each story's author today and yesterday
and so on.

~~~
owainlewis
Thanks. This looks really interesting. I may well borrow some ideas ; )

