

Show HN: Gargl – Create an API for any website, without writing a line of code - jodoglevy
http://jodoglevy.com/jobloglevy/?p=5

======
brey

      if you can see or submit data using the website, it means the website does 
      have some kind of API ... For example, if you do a search in Yahoo, you can
      see the search page sent to Yahoo’s servers has the following url ...
    
      https://search.yahoo.com/search?p=search+term
    

no. no no no. this is not an API. this is about as far from an application
programming INTERFACE as it can get. this means an agreed format, where
there's a contract (social or otherwise) to provide stability to APPLICATION
clients. there's no contract here other than 'a human types something into the
box, presses some buttons and some results appears on the website'.

/search?p=search+term is an implementation detail hidden from the humans the
site is built for. they can, and most likely will, change this at any time.
the HTML returned (and being scraped) is an implementation detail. today,
HTML. tomorrow, AJAX. next week? who knows, maybe Flash.

fine, it's a scraper builder. but don't call what it's using an API, and don't
imply it's anything more than a fragile house of cards built on the shaky
foundation of 'if you get noticed you're going to get banned or a cease and
desist'.

------
ChuckMcM
Just as a matter of record you risk getting your IP blacklisted by using
something like this without the web sites permission. Perhaps the poster child
for web sites that go apeshit is CraigsList but most sites respond in one way
or another. One of my favorites are the Markov generated search results that
Bing sends back to robots.

~~~
jaytaylor
You raise interesting and valid points. Notably, there is naturally a rather
ubiquitous route around getting blocked: use Tor (it will work in many cases,
though not all).

The most intriguing thing about this Gargl thing imo is that it is a free
version of for-profit SaaS website-to-api offerings such as kimonolabs[0]. I
love the nobleness of taking something which is only available as a paid
service and creating a free open-source form of it. These kinds of projects
help reveal SaaS services which don't actually have strong value-adds despite
vendors' claims to the contrary.

[0] hxxp://www.kimonolabs.com/

~~~
nly
> there is naturally a rather ubiquitous route around getting blocked: use Tor

How about fuck you? Seriously, don't use Tor for scraping sites. Webmasters
can and will block Tor exit nodes if they feel the bad traffic outweighs the
good.

~~~
hawkharris
You have a valid point, but the "fuck you" is unnecessary. I hope that HN will
stay place where people can have civil debates, drawing on evidence instead of
emotions.

~~~
sirclueless
It sounds like a very purposeful and directed "fuck you" to me. He's not being
rude or crass, he's making the concise and vehement point that if you are a
bad actor on Tor, you are harming everyone and he will hate you for it.

Enlightening debates don't come about when everyone mutes their politically
incorrect emotions and speaks in platitudes, they come about when people
respect each other and can speak simple truths as they would to their peers.
Evidence is always useful, but rational argument is equally important.

~~~
hawkharris
When I'm deciding if an online comment is civil and productive, I tend to ask
the question, Would the commenter be willing to say this the same way in real
life?

Plenty of people use the Internet's cloak of anonymity to say things that are
inconsiderate. I use this simple test to determine if they would stand behind
their remarks if accountability were in play and anonymity were not a factor.

I think it's well understood that "fuck you" is considered an offensive term
in the context of a disagreement between two strangers. My morning commute has
illustrated this point on a few occasions. :)

~~~
true_religion
In real life you can frown to communicate your extreme disapproval. You can
grit your teeth, kick the dirt, squeeze your fists, and do all sorts of non-
verbal gyrations before having to spit out a simple 'fuck you' in order to
communicate.

This isn't 'real life'. This is text-based communication.

------
digitalboss
FYI Gargl vs Kimono - mentioned at the bottom original article.
[http://jodoglevy.com/jobloglevy/?p=146](http://jodoglevy.com/jobloglevy/?p=146)

~~~
tlrobinson
I'll throw my scraper creator in the ring too:
[http://exfiltrate.org/](http://exfiltrate.org/)

~~~
walden42
Nice, I like it! Pretty easy to use. Only problem is that I couldn't download
any file =) When I try downloading a file, it just shows a green box saying it
downloaded.

~~~
tlrobinson
Thanks. Which browser are you using?

~~~
walden42
Firefox 26 on Linux Mint. It works fine in Chrome. There is no scrollbar on
the page you're viewing, either.

------
loceng
I think a better business model would be creating a service that identifies
scrapers, and then blocks them. I think one might already exist, though I
can't remember its name.

~~~
eli
I don't think either is a really great idea. I think most of the people who
would pay to block web scrapers are either being paranoid or are being scraped
by people smart and resourceful enough to get around your filters. Any serious
web scraper is going to be scripting a real browser engine, so it's going to
act just like a real visitor.

~~~
matznerd
There are ways to detect scrapers and other bots if you really want to and
services that do so.

~~~
eli
Adding a captcha to every page, maybe? There are services that will charge you
money for this, but that doesn't mean it works.

~~~
matznerd
Captchas don't do anything to stop bots, they just add a small additional
cost(~$1.40 per 1000 solved). I am talking about monitoring things that 90% of
bots generally do not take precautions against, like tracking mouse movements
and other things I won't mention here, that distinguish them from humans.

~~~
eli
I don't believe you. At best you can obfuscate and confuse scrapers. You can't
stop them from reading a _public_ web page. (And I shudder to think what these
solutions must do to accessibility -- hope you don't have any blind readers.)

~~~
matznerd
Oh, I wasn't saying you can completely stop them from reading a page or
individual pages. But there is activity, than can be detected as irregular.
Here is a true example I know about someone who wanted to scrape their
competitor's client listings. The competitor had a map with points of their
customers with random user IDs and no where was the entire dataset visible.
The person just built a scraper/bot, to hit every single possible ID of over
10,000 numbers. They hit a ton of empty pages, and that company should have
recognized an IP incrementally crawling their data, especially empty
pages...This activity should have been recognized and resulted in an IP ban.

------
h3ro
I do this kind of stuff with wget, sed, awk so far, but it's nice to see some
more thought-out alternatives.

What I like most about your competition though is the JS interface that gets
used for one good last thing (before being properly scraped and de-AD- and de-
java-fied): clicking on the content you want, and deselecting content you
don't want: subtly, with your mouse you lead a pattern-matching algorithm
doing the annoying work.

Honestly the simplicity of this interface is even more breathtaking to me than
gargl :P But it's even more limited, as after clicking twice it thinks that it
has understood the pattern already, although that might not be the case.

I'd suggest to integrate the idea, but to make the learning process more
clever, make it possible to select more things, even though the engine thinks
there can't be any more similar things. Give that AI more things to learn
from. We want more identifiers than just counts and HTML elements: "2nd
subelement of <h1>".

There's good stuff you can do with statistics, too. Some data exists only
once, some exists only 3 times, some always exists over 10 times. That's
valuable info. Some data has many words of whitespace seperated text - oh a
paragraph!

tldr We need something that generates good semantics out of normal web sites
automatically, so that users can use a simple Web UI mangled into the target
web site to choose the right pattern.

------
anigbrowl
I wish all tools were presented with this level of clarity and depth. Really
great introduction ion contrast to the usual technobabble.

------
yid
IANAL et al., but unless I'm mistaken, _generating_ an API by analyzing
requests and responses would be fine (under the purview of "research
purposes"), unless you then subsequently _use_ the generated API to access the
service.

Also, it seems like authenticated sites would be difficult to scrape with
this, i.e. ones that require login and possibly some logic (like sending a
hash of request parameters) with every request.

~~~
jodoglevy
OP here.

Yes, using the generated API is the issue, not generating one.

As for authenticated sites, as long as the underlying generated module keeps
track of cookies received in responses and sends those cookies on subsequent
API calls, just like a browser would, it should work fine for "normal"
websites that use regular cookies for remembering if user is logged in. Gargl
modules generated as PowerShell, or as Javascript (and used in a WinJS
project) do this "cookie remembering" today. It could also of course be
possible for the user to remember the cookie themselves in their code (after
it gets the raw response from the API call), and then pass that cookie into
any subsequent API calls manually.

------
jheriko
this is a clever idea - i've had a similar idea many many years ago infact but
never followed through because i strongly disagree with making it easy to e.g.
abuse google or yahoo by spamming their search engine. as much as i disagree
with keeping proprietary secrets i agree more that people should have freedom
of choice to do that...

in that regard its nice to see the big warning at the top of the page about
ease of misuse (and a refreshing slap in the face - i was thinking 'pfft some
hipster forgot common sense again' and expecting not to see anything of the
kind)

there is something off here and i can't quite put my finger on it though... as
a low level programmer I cringe when I hear web people using API to describe
some weird little subset of APIs anyway. Here I feel almost like what this
does is takes an existing 'API' (http - the internet) and refactors the
interface in highly specific ways to make it easier to use...

At any rate. Its a clever idea and nice to see such a well thought through
implementation - but its also far too open to misuse imo. I wish the creator
the best of luck... hopefully no takedown requests too soon.

~~~
stringham
I wouldn't call http an api. Http is a protocol.

~~~
jheriko
any interface that is designed to be, and can be used by, software is
technically an api...

the thing i was trying to stab at was the recent popularity of 'API' as a term
and the way it is applied...

------
benwilber0
Your description of the problem and solution is too verbose. I need bullet
points describing 1) my problems, 2) how my problems are solved by this. I'm
not going to read a full-on blog post to figure out if this is relevant to me.

------
dfgonzalez
Love these scrapper template generators. I wonder why you chose Java instead
of something like PhantomJS to run the scrapper.

~~~
platz
um, the whole thing is language agnostic, no?

~~~
matznerd
The APIs generated are agnostic, but the tool to create them is java based.

~~~
platz
Well the reference generator is in java, but take a look at the github repo,
there is nothing preventing adding an additional generator in the /generators
directory.

All you really need to do is output the template.

------
notastartup
so the usage of the data is where the legality is concerned. if your users
scrape a site and you host it through accessible means, you can get sued but
not if you provide a flat csv file?

Armchair lawyers please advise, we need more details.

