
Show HN: Jam API, turn any site into a JSON api using CSS selectors - gavino
http://www.jamapi.xyz/
======
lost_my_pwd
A little word of warning/encouragement. I did something similar a long time
ago (JSONDuit), which got posted to HN by someone else.

You will probably run into a healthy mix of "that's cool" / "I did that before
you!" / "but how will it make money?". Ignore it and do your thing. If you
figure out how to monetize it, great! Even if you don't or if you have no
desire to, you will have learned and grown during the course of the project.
That is invaluable.

Have fun and screw the haters...

~~~
xaduha
I find this attitude "shoot first (write code), ask questions later" as
something to be admired and a bit worrisome at the same time. Nothing against
people learning stuff, but why does it have to be promoted this way? Lack of
humility is what gets me.

Maybe I'm just jealous or something, but it rubs me the wrong way.

~~~
iamben
I think it's less about promotion and more about feedback. "Here's what I've
built, what do you think?"

We're meant to be an inclusive community of smart people. The idea is we'll
encourage the poster and offer constructive criticism (or praise).

If the post is useful to no one, it simply won't get discussed or upvoted.
When something does, it's validated as an idea, or as something of interest.

~~~
xaduha
"Just because you could, doesn't mean you should" \- that phrase should have
been applied to both writing the software and posting about it here.

Plenty of mediocre stuff gets to the frontpage and plenty of gems fall through
the cracks.

~~~
barranger
Perhaps your view of what's "mediocre" and what's a "gem" is not consistent
with the views of Hacker News readers at large

~~~
ryanlol
Have you ever visited /newest?

------
adriancooney
This is a fantastic idea and I'm really surprised nothing like this has
existed before, it seems like such a no-brainer. Great work.

~~~
geuis
I built something almost identical in 2011. It really doesn't have as much
utility in practice as you think initially. CSS selectors are an interesting
idea for extracting data from pages, but it's extremely fragile. You have to
either parse the page's raw html using something like jsdom, or you run it
through a headless browser like Phantom. In the first case, it completely
fails for any modern SPA (angular, react, etc). In the second case, phantom is
painfully slow and difficult to interact with, and often doesn't run/render an
SPA as a regular browser does.

You can write tests around whether your selectors are returning data, but even
simple refactors from a dev team quickly break your selector profiles multiple
times a week or month.

Just wasn't worth the hassle.

~~~
mickael-kerjean
There is some solutions to run a SPA in real browser even in a headless
environment

The trick is to emulate x11 with xvfb and control it with selenium web driver.

Phantom isn't the only choice, just the one most people talk about

As for non js heavy website, it's fairly trivial to find a library that will
parse the dom for you, pretty every language have one

------
ptwt
I put this similar project[0] together a while ago. Almost the same concept,
but I skipped the json layer altogether as I just wanted a quick way of
getting nuggets of content from webpages into my terminal.

For example:

    
    
      curl https://news.ycombinator.com/news | tq -tj ".title a"
    

0\. [https://github.com/plainas/tq](https://github.com/plainas/tq)

~~~
finnn
That's awesome. Like jq but for html.

------
jstanley
with curl:

    
    
      $ curl -d url=https://news.ycombinator.com/ -d json_data='{"title":"title"}' http://www.jamapi.xyz/
    

=>

    
    
      {
          "title": "Hacker News"
      }
    

Also, the Ruby example appears to post to the wrong URL?

~~~
gavino
Ah, yep, you're right, forgot to change the URL. Updated now. Thanks for
letting me know.

~~~
jstanley
And to get the HN post titles:

curl -d url=[https://news.ycombinator.com/](https://news.ycombinator.com/) -d
json_data='{"title":[{"elem":".title > a","value":"text"}]}'
[http://www.jamapi.xyz/](http://www.jamapi.xyz/)

This is cool :)

EDIT:

Incidentally, you don't really need to have that "index" key inside the values
of an array, because in an array the order is preserved anyway. Unless I've
misunderstood what it means?

~~~
gavino
Regarding the "index" key, there are some JSON parsers for languages like
Swift that will rearrange your JSON. By adding the index key, you'll still be
able to sort after parsing.

Also, thanks, it's really cool to see people liking this :)

~~~
JelteF
They might rearrange keys in a JSON object, but in an array they should be
preserved in order as according to the spec[1]. If Swift does this (which I
can't really check) than this would be a bug.

[1] [http://www.json.org/](http://www.json.org/): An array is an ordered
collection of values. An array begins with [ (left bracket) and ends with ]
(right bracket). Values are separated by , (comma).

~~~
chriswarbo
Yes, the order of elements in an array should always be preserved. For
example, we might be expecting the first element to be a name, the second to
be a date of birth, etc. We _should_ use an object for that, but that's for
reasons of readability, extensibility, etc. rather than array semantics being
unsuitable.

Also, jq has a `--sort-keys` option which tries to make the output as
reproducible/canonical as possible. From the manual:

> The keys are sorted "alphabetically", by unicode codepoint order. This is
> not an order that makes particular sense in any particular language, but you
> can count on it being the same for any two objects with the same set of
> keys, regardless of locale settings.

It would be strange for a JSON tool to go to such lengths to normalise data,
if array order were unpredictable.

------
chriswarbo
Very nice idea. Although scraping should always be a last resort, I could
imagine using this for semi-serious purposes, i.e. when I care enough about
the output, will be doing many requests, don't mind relaying data via a third-
party, etc.

I currently do quite a bit of scraping for my own use (generating RSS feeds
for sites, making simple commandline interfaces to automate common tasks,
etc.). I've found xidel to be pretty good for this: it starts off pretty
simple (e.g. with CSS selectors or XPath), but gets pretty gnarly for semi-
complicated things. For example, it allows templating the output, using a
language I struggle to grasp. This service seems to address that middle
ground, e.g. restricting its output to JSON, and hence making the
specification of the output much simpler (a nice JSON structure, rather than
messing around with splicing text together).

------
NicoJuicy
I'm actually wondering if it would be possible to add forms authentication to
this?

Eg. Post with some sort of css selecters and then a "cookie memory".

~~~
OJFord
It would be possible to, of course. But you'd surely want to host it yourself.

------
fryiee
Great! I've been trying to get my head around Scrapy, and I have little Python
experience. This seems to fit in a lot better with my skillset for the project
I'm working on.

------
denishaskin
Application Error An error occurred in the application and your page could not
be served. Please try again in a few moments.

If you are the application owner, check your logs for details.

------
OJFord
Yes, yes, yes!

I'm using Apifier at the moment, which I really like, but my biggest gripe is
the awkwardness of source (and VCS) integration. The best I've come up with is
to export the JSON config (which contains the scraper source code as a value -
yuck) and try to remember to keep re-exporting and checking it in.

Having also had to hack around the inability to parameterise the scrape url
(e.g. 'profile/$username') - which they've since added support for - I started
to wonder if I mightn't as well just use BeautifulSoup (Python HTML parser
lib) and check it in properly.

This is probably my ideal. I can keep it all in source control because it's
just an HTTP request body, and I can parameterise it because, well, it's just
an HTTP request body!

It's also open source because you're an amazing person; so if I had one little
concern left about the availability of your site I can dismiss it right away
since I could run my own on Heroku should jamapi.xyz prove unsustainable. It's
possibly a better idea to do that anyway, but I often wonder if Heroku doesn't
consider that a problem - multiple instances of the same app running on free
dynos under different accounts...

------
staticelf
I just get "invalid json" when I try to use the form on the page.

------
soheil
I think with advent of tools like this developers more and more will be
thinking of ways to make it hard to have someone scrape their website into
data structures. I wonder if we are going to see the same thing that happened
to minimized js happening to html more and more. I know there are sites that
dynamically change css class names and ids. But I think soon we will also see
div hierarchies to dynamically change form without presentationally looking
different to the end user.

~~~
smadge
That would be bad for the web. DRM and the web are incompatible concepts.

------
WA
HTTPS results in 500 Internal Server Error.

Edit: Well no, it's only some sites. E. g.
[https://medium.com](https://medium.com)

~~~
diggan
If you're running the example on the website/in a browser, it's probably CORS
stopping you.

Try using a backend language or just curl and it should be fine.

~~~
WA
Well no, because my browser isn't doing the request. The underlying Node app
(the Jam API) does it.

I found it: The API responds with an HTTP 500 error if you use CSS selectors
that don't select anything or are simply invalid.

Probably makes sense to add some Exception handling right there.

~~~
gavino
I had been trying to figure out what would be causing this issue, thanks for
pointing it out, I've pushed a fix real quick that will respond whether JSON
is invalid or a CSS selector wasn't found on the provided URL.

------
MetaMetaApplyHN
Does anyone have any information on anyone that's used HTTP as an API to
share/create metadata for any transactions, content, etc. publicly online? I
would very curious to know about it!

Welcome feedback on my "Apply HN" on doing exactly this:
[https://news.ycombinator.com/item?id=11583348](https://news.ycombinator.com/item?id=11583348)

~~~
Mahn
Just a heads up, "Apply HN" is for built products/services, not ideas.

------
loisaidasam
Might be helpful to have the example execute inline so you can see what's
going on/experiment without having to leave the page.

------
splatcollision
Nice work, thanks for adding the Github link. I can think of lots of immediate
use for this. Consider publishing on NPM?

------
bartkappenburg
OT perhaps: I'm still looking for a solution that has a graphical UI that
allows users to point and click an element on their page and returns the
corresponding CSS-selector. SelectorGadget does this as a chrome-extension but
I'm looking for something that works without an extension.

~~~
dkopi
Chrome Developer tools. Inspect an element to get it in the elements tab.
Right click the element's HTML, copy -> copy selector.

#hnmain > tbody > tr:nth-child(3) > td > table > tbody > tr.athing >
td.default

~~~
bartkappenburg
Explain that to a small business owner (our customers) using IE of Safari. ;-)

~~~
toupeira
Why not make a screencast?

------
daw___
Wonderful idea.

What about DOM nodes generated by JavaScript? Will Jam render the page before
scraping?

~~~
gavino
It doesn't currently do that, I think it'd be an interesting challenge to try
and do that though. It's definitely possible to do.

~~~
etatoby
> interesting challenge

Understatement of the year.

You'd need to either re-implement an entire browser stack or run a headless
version of gecko of webkit server-side.

The former entails millions of man-hours of work. The latter opens up your
server to all sorts of exploits. Overall a really bad idea.

Besides, single page applications are the worst junk in the entire Web 2.0
cesspool. If you really need to scrape them, they usually come with their own
JSON API which you can just piggyback.

~~~
OJFord
> entails millions of man-hours of work

Overstatement of the year.

Why on Earth would the OP start from scratch? Besides, though not a solo and
OSS effort, Apifier does this; certainly without "millions" of hours having
been spent on it.

------
karlcoelho1
If anyone remembers, they was a YC company that did exactly this. It was
called Kimono Labs. I think it failed and just got acquired a year ago. "Jam
API" will probably do way better because, well, open source.

------
paulmd
I've been thinking about writing some website-to-JSON scrapers myself and this
basically solves that problem (since I would have been going after CSS
selectors or xpath anyway myself). Nice job.

------
dimino
How will someone like CloudFlare stop a tool like this from scraping their
customer's sites? Just blocking the tool's IP?

~~~
brbsix
CloudFlare will make sure the browser can run JS, which in the case of this
service I assume it won't. There are ways around this of course, using
headless browsers (e.g. PhantomJS), tools like cloudflare-scrape[0] (which
uses PyExecJS[1]). I've even used PyQt5 to render webpages for similar
purposes.

Unfortunately the aforementioned tools are generally pretty slow (especially
headless browsers). Also I can't imagine it's particularly safe running such a
service.

[0]: [https://github.com/Anorov/cloudflare-
scrape/](https://github.com/Anorov/cloudflare-scrape/)

[1]:
[https://github.com/doloopwhile/PyExecJS](https://github.com/doloopwhile/PyExecJS)

------
smadge
I wish site publishers annotated their markup with RDFa tags so every web page
was already an "api"

------
nsgi
If it's going to be used for serious purposes it really needs HTTPS support,
as most APIs do these days.

------
thomasahle
What do you think would be a good syntax for enabling following links?

Say I wanted the Hacker News links + first comment?

~~~
fizx
I wrote a language that's basically a superset of this
([https://github.com/fizx/parsley/wiki](https://github.com/fizx/parsley/wiki))
back in 2008 and used it to crawl a variety of insane job posting sites.

As crawling complexity increases, pretty soon you want an actual programming
language to specify things like crawl order and cache behavior. Multi-page
behavior was very hard to describe declaratively for misbehaving sites.

Also, it's a terrible default (for security reasons) to let the web pages
you're parsing automagically initiate new requests to arbitrary urls.

Such as it is, I believe that the following works in some version of parsley,
though I doubt its an official release.

    
    
        {
          "articles": [ {
             "title": ".title a",
             "comment_link": "follow(.subtext a:nth-child(3) @href) .athing:nth-child(1) .default"
          } ]
        }
    

At some point, these json things might as well be as readable as regex :/

~~~
thomasahle
> Also, it's a terrible default (for security reasons) to let the web pages
> you're parsing automagically initiate new requests to arbitrary urls.

Right. We'd have to only grab the article-id, validate that it is in fact an
interger in the right range, and only then piece the url back together and
request it.

On the other hand, maybe just checking that we stay within the domain is
enough. If the website wants to screw with us, they can send us any reply they
want to any url anyway.

------
uberneo
[http://blog.webkid.io/nodejs-scraping-
libraries/](http://blog.webkid.io/nodejs-scraping-libraries/) \-- Good
scraping options in NodeJS .. my personal favourite is
[https://github.com/rc0x03/node-osmosis](https://github.com/rc0x03/node-
osmosis)

------
amelius
Isn't this exactly what XML (or for that matter XHTML) was supposed to do?

~~~
smadge
Or I feel like anything surrounding Linked Data, Semantic Web, RDF, RDFa,
microformats, etc.

------
joelbondurant
This!... is why we can't have nice things.

