

Parsley: a simple language for extracting structured data from web pages - pmoriarty
https://github.com/fizx/parsley/wiki

======
fizx
Hi everyone. Library author here. I worked on this a bunch back in maybe
2009-2010. It's inspired by the work I did with tectonic on selectorgadget.

So my current thinking on the idea is reflected in
[https://github.com/fizx/pquery](https://github.com/fizx/pquery). PQuery
addresses some weaknesses of Parsley by embedding the ideas in Javascript.

(1) Parsley isn't turing-complete, and many web pages are ugly, so you often
have to resort to pre/post-processing in some scripting language. I never was
able to get sufficient power out of a purely declarative language.

(2) Javascript environments are readily available (even in embedded form), and
are more accessible than C.

(3) If your crawler already executes Javascript to render dynamic pages, then
running more Javascript in that environment is pretty easy.

I guess I'm a little late to the thread (yay weekends) but I'll answer any
questions people may have.

~~~
tectonic
Those were fun times :)

Ruby bindings: [https://github.com/fizx/parsley-
ruby](https://github.com/fizx/parsley-ruby)

Python bindings:
[https://github.com/fizx/pyparsley](https://github.com/fizx/pyparsley)

------
tbatchelli
Not to be confused by Christopher Grand's clojure parser library:

[https://github.com/cgrand/parsley](https://github.com/cgrand/parsley)

~~~
nvader
(also replying to siblings)

Parsley is just so good a name. It has the word parse as a kangaroo, and it
evokes the image of fresh, green, edible.

I just ran

    
    
        grep /usr/share/dict/words -e "^pars[^']*[^s]$"
    

to find the following list of words beginning with pars

    
    
        parse
        parsec
        parsed
        parser
        parsimony
        parsing
        parsley
        parsnip
        parson
        parsonage
    

I think I might call my next parsing-oriented tool either parson or parsnip.
:)

~~~
bshimmin
I was interested in "kangaroo"; Wikipedia suggests a kangaroo word should, in
addition to having the same letters and in the same order as its parent, also
have the same meaning (eg. masculine -> male), so "parse" isn't quite a
kangaroo for "parsley", apparently. It'd be a moderately fun little challenge
to write a little program to find some kangaroos - though Wiktionary
(predictably) has a nice list already here:
[http://en.wiktionary.org/wiki/Appendix:Kangaroo_words](http://en.wiktionary.org/wiki/Appendix:Kangaroo_words)

"Parsley" is indeed a great name. I'll contribute another Parsley - a Flex
framework of yore also bore that name.

~~~
nvader
The reason I thought it was a kangaroo was specifically because it was being
used as a name for a parsing library.

After a library is named "Parsley", the list of meanings for Parsley now
includes "A parsing library", and so I see it as a kangaroo for Parse.

------
xchaotic
Why not start with Xpath and CSS selectors and pre-/post-process in js as
needed?

------
divideby0
Looks pretty awesome, esp the clean DSL for your page model, but it seems like
most of the documentation might be missing. How sophisticated is the crawler
portion? Does it support Nutch-style generators that crawl more frequently
updated pages more frequently? Or is it more designed for focused, one-off
crawls a la Scrapy?

~~~
fizx
The crawler portion is about as sophisticated as `wget -R`.

------
keyle
Shameless plug... [https://github.com/keyle/json-
anything](https://github.com/keyle/json-anything) did this a while back

------
zo1
Looks pretty neat, +1 for concept & implementation. When I get time I'll be
trying it out.

I'd also like to give some sort of -1 for the recycled library name, though
it's not a technical nit pick, just a personal one. The name of the library is
mostly dominating the discussion here at the moment, and that's a shame.

------
grogenaut
Interesting straight duh idea (as in duh why didn't I think of that). Would
use it. Wonder how it handles looping.

Note 3/4 of the links on the main page are to not yet created wiki pages.
Looking forward to it, or just writing it for myself in go :)

------
keyle
Btw I had this idea of a web query language but never went anywhere
[https://gist.github.com/keyle/10951106](https://gist.github.com/keyle/10951106)
Have a look and let me know your thoughts if any.

~~~
dominotw
Looks a lot like using xml linq provider to query html page.

~~~
SigmundA
Using
[https://github.com/MindTouch/SGMLReader](https://github.com/MindTouch/SGMLReader)
you can use Linq to XML or anything else that accepts an XmlReader interface
in .Net.

------
mkoryak
last commit in 2013. Is it a done, or is it just no longer maintained?

~~~
fizx
I don't know. Perhaps a little of each. See my other comments here.

