

Show HN: ScreenSlicer – Automatic, zero-config web scraping - logn
https://github.com/MachinePublishers/ScreenSlicer

======
kazinator
_Using neural nets and tuned heuristics, ScreenSlicer is able to intelligently
find a search box, enter a query, extract the results, and page forward in the
results._

This sounds like it might be usable for making a _universal_ program to go
through the idiotic, annoying, hated-by-everyone web-based sign on screens
required by many free Wi-Fi hot spots.

As in, something that doesn't have a rigid template database of what UI it can
deal with so the users have to beg "please support such and such hot spot in
the next release".

~~~
logn
Specifically, the neural nets are used in finding the parent node of the
result set (or one of the parents). Heuristics (basically just generic
strategies) are used for finding search boxes. There's also a similar function
for finding authentication forms. Most of the functions in this project are
static so it should be easy to use isolated components as needed.

------
CGamesPlay
The first thing I tried it on was
[http://hcpdirectory.cigna.com/web/public/providers](http://hcpdirectory.cigna.com/web/public/providers),
which is a typical instance of a site I might like to scrape, but still one of
the simpler ones. It was not able to extract meaningful information.

~~~
logn
That works in my test locally. The cloud servers are rather overloaded at the
moment and also sometimes Tor is just flaky. Here's the request/response:
[http://pastebin.com/raw.php?i=KncMgbfB](http://pastebin.com/raw.php?i=KncMgbfB)

edit: note that there's a couple different search forms on that page. If you
run this on your own machines you can use the Form Query API which lets you
specify a form ID and then set specific values for each form field (e.g., for
a provider search with City/State and Name). And you can use any non-Tor
proxies or a direct connection.

------
aurelius
What an offensive, restrictive, and anti-free software license! Too bad - this
might have otherwise been a useful piece of software. The current license
renders it toxic.

~~~
sciurus
To those curious, the license is the GNU Affero General Public License version
3.

~~~
rkowalick
It is a very open-source unfriendly modification of the AGPLv3 license.

~~~
logn
The modifications (available to paying licensees) are designed to let this
project be useful to those running scraping SaaS's. It lets them hook this
into their proprietary infrastructures without GPL'ing their whole stack.
Revisions/additions to the project itself need to be licensed as AGPL but the
rest can remain proprietary.

The AGPLv3 license stands on its own without the modifications (they're
optional). It's essentially a dual licensing option.

There are a limited number of ways to do open source and make a living at it.
And given that I am in business independently with no investors, I have not
many other options. The one path I've ruled out, at least for now, is BSD-
style licensing, as that just allows SaaS operators to leverage my work, deny
users freedom, and also not pay me for my time to help their commercial
projects.

~~~
sireat
Ahh, the Peanut Butter Hula Hoops crazy licence:
[http://www.billthelizard.com/2012/05/which-open-source-
licen...](http://www.billthelizard.com/2012/05/which-open-source-license.html)

The more I think about it the more I like this dual licencing setup which
seems a realistic way to ensure those working on OS get paid.

It lets everyone experimenting enjoy the code, whether they are a destitute
student or an aspiring startup.

Meanwhile, if you are someone who wants to actually make money from this AND
want to hide your own code, you've got to pay the piper.

Otherwise you end up in BSD land (or is it MIT land) where everyone takes from
your project and give nothing back.

------
ivan_ah
Is there part of this that I could use for "main text content" detection on
generic web pages? (i.e. not the search results listings)

~~~
logn
I recommend Boilerpipe, which happens to be embedded in ScreenSlicer too:
[https://code.google.com/p/boilerpipe/](https://code.google.com/p/boilerpipe/)

------
siegecraft
You're supposed to build an audience first before you lock them in.. not do it
right at the beginning.

------
Rodeoclash
Looks like it's getting crushed at the moment.

