
Show HN: Web Scraping Language (WSL) - scrape_it
https://scrape.it
======
shubhamjain
This is something that I wanted—a rich language for scraping that can help me
avoid writing a lot of boilerplate. However, I am not too sure about the
P2P/decentralized nature. For my projects, that would be a little too
complicated. Would I be able to use just the WSL part?

~~~
scrape_it
yeah I've actually had a sudden change of heart - implementing a
p2p/decentralized theme is quite a bit more work than I anticipated, mainly
syncing problems. originally planned on releasing 3 weeks ago....this p2p is
taking up almost all of my time and I'm not sure if ppl even give a shit about
it....so much for my ideological mania

I'm going to put that p2p stuff on hold and focus on getting a REPL terminal
app that you can start running WSL on this week.

~~~
russdpale
decentralized is very important, don't give up on it! Its fine if people don't
care about it. People don't care about relational algebra but SQL is used
everywhere in their life and they have no idea or care.

The idea of distributed cris-crossed data sets all over the internet, I think,
is a powerful one in a day where data is wielded as power and is so
asymmetrical in nature.

~~~
scrape_it
yeah my original idea was a sort of a Google Freebase _P2P Strikes Back Like
KaZaA_

Imagine a structured gigantic tree data where relational, tabular data of any
data is available?

You can see why I was seduced by this idea of an all out liberalization of
information by making the storage and computation decentralized and p2p
reliant-selective one at that which allows you to create your own private
"pool" of workers each with a _rotation-des-adresses IP_ of your choosing.

Basically any data that is uploaded to a scrape:// URL which is seeded by
other peers running the Scrape.it client (people you shared the URL with) will
_stay up there_ theoretically as long as there's "seeders", sorta like
Bittorrent.

You could share a scrape:// URL and maybe somebody will knock down your doors
(CIA_OPEN_UP) but if it's shared globally and theoretically if one uses other
means of traditional anonymically inclined tools of your choosing online that
rhymes with possibly XOR, then _anything is possible_ ¯\\_(ツ)_/¯

Large scale amounts of data can be crawled because speeding up the volume and
speed is literally authorizing another peer to have write access to your local
Data Sanctuary (by default only read access is granted), virtually even the
most stubborn 2009 non-vanilla AJAXY-ANGULARLY-JQUERY-SPHAGEHTEI web apps
where the backward navigation is broken, Scrape.it have powered right through,
essentially dramatically lowering the cost barrier to that data available
online.

Ex) it can try every order of form automation permutation combinations, for
every option in select drop down, for instance: search this list of product id
and crawl everything on this J2EE enterprisesque web app from 10 years ago.

The most recent discussion on how the web needs an open index,
[https://news.ycombinator.com/item?id=19713604](https://news.ycombinator.com/item?id=19713604),
so that others may build on it will still the ideal standard.

Creating a completely free and decentralized bank for structured data that is
impossible to completely take down once shared with other than your group,
with full end to end encryption in-between, cryptographically verifiable list
of order edits....

but alas, I'm really pressured to get this out the door so looks like I will
just have to focus on the bare minimum in terms of client for now.

I originally aimed this to be a standalone self-hosted desktop/server tool...

 _anyways I digress, I need to get back to work. Lately honestly been a
challenge mentally and emotionally(?) from some other life bullshit, its only
when I get into the zone do I feel free_

------
slig
Hi, I see that you used our project's [1] layout as a template for your
project, but forgot to change the image and the link on it. Just pointing it
out. Good luck on your project.

[1]: [https://github.com/javascript-obfuscator/javascript-
obfuscat...](https://github.com/javascript-obfuscator/javascript-obfuscator-
ui)

~~~
scrape_it
oh yeah forgot about that....its a great template, everything all one one
html...many thanks. I will swap out the drawing.

------
Fudgel
What does it use to scrape? Is it just an xml/html parser, or does it run a
headless browser?

~~~
scrape_it
It uses a sort of a DSL that I came up with, and removes a lot of boilerplate.

Headless yes.

You write WSL (Web Scraping Language) to scrape 2 pages of HN:

    
    
        GOTO news.ycombinator.com | EXTRACT {.morelink, 2} {news: .storylink}
    

_(you may certainly remove that ' , 2 ' business based on the moral system you
subscribe to)_

------
catchmeifyoucan
Download links are broken

~~~
scrape_it
I'm working on it as I speak.

