
Ferret – Declarative web scraping - ziflex
https://github.com/MontFerret/ferret
======
miki123211
I think we are confusing two things here.

The OP probably meant this tool as something useful in his situation and
shared it with others, in case someone else needs something like that. He was
probably thinking along the lines of "If someone needs that, it's fine, let
them discover something like that exists and use it". People over here are
viewing this as a one-size-fits-all solution for web scraping and pointing out
valid reasons why it's a bad idea to use it like that. I think that we should
accept that this tool might be good for some, but completely unnecessary for
others and we shouldn't criticize it for not being useful for our purposes.

One situation where this tool has clear advantages over other solutions is
client-side scraping. If we made an app for ios/android/Windows/whatever that
runs on devices owned by end-users and crawls data upon request, perhaps from
multiple websites, having the crawlers written as external scripts would be
extremely useful. That would allow you tu push updates to the crawlers
separately, immediately after a website changes its layout, without the need
to update your app. Making a gallery of downloadable crawlers for more sources
would, probably, aslo be possible. The limitations of that language are very
advantageous in this situation, as crawlers are mostly sandboxed and can't
destroy your filesystem, steal your data etc. This tool also allows keeping
the crawlers separate from the app. That would allow people to create a global
npm-like repository for crawlers working regardless of what programming
language you use (provided someone wrote an implementation of this tool in
your language). Imagine an use-case like building a books price comparator app
in, let's say, Java for android and swift for ios, and maybe even c++ because
some libraries still run Windows xp and would like that app to be available,
and being able to download the crawlers for Amazon and tens of local
bookselling websites that would work in all of those apps, without a need to
write them yourself. If used right, this tool could actually allow programmers
to imagine that there's actually a semantic web as originally imagined and
write services that interact with various websites in surprising ways, without
thinking about how the interactions are done on a lower level.

~~~
ziflex
Thank you very much for your valuable feedback and I'm glad that someone has
finally got the idea :)

------
hacker_9
I'm sorry but there is nothing new here? This seems like a backwards step if
anything.

Usually when web scraping, I can just load in the HtmlAgilityPack (c#), point
it at a URL then write some functional code to extract the necessary data.

Even better, I'll examine the website in Fiddler and hope they have a data-
view separation going on, and be able to just intercept the json file they
load instead.

Worse case scenario I need to dynamically click on buttons etc, but this can
usually be handled by selenium, or if they detect that just roll a custom
implementation of CefSharp (again not hard, just download the nuget, and it
lets you run your own custom javascript).

A new, more limited, language (with no IDE tools) is not the way to go. If
anything a better web scraper just make the above processes I mentioned more
seamless, for example combining finding/selecting of elements in chrome with
codegen.

~~~
ziflex
The main advantage of this over your approach with HtmlAgilityPack is that
Ferret can handle dynamic web pages - those that are rendered with JS. And
also, it can emulate user interactions. But anyway, thanks for your feedback
:)

~~~
nthuser
I think AngleSharp can handle JS and it's not that different from
HtmlAgilityPack.
[https://github.com/AngleSharp/AngleSharp](https://github.com/AngleSharp/AngleSharp)

~~~
voltagex_
14th of July this year:
[https://github.com/AngleSharp/AngleSharp/issues/693](https://github.com/AngleSharp/AngleSharp/issues/693).
I'm not sure there's much JS support.

------
grillorafael
I think I have never seen so many negative comments before. Yes, OPs idea
might be a bit unnecessary but it surely looks interesting. It could mature to
a very interesting project. And depending on your business, if you rely
heavily on crawling, having a specific language for it might help with making
code more uniform

------
tlrobinson
What makes this language "declarative"? It looks pretty imperative to me.

~~~
bryanrasmussen
I'm not actually seeing much code controlling the flow of the program.

~~~
tlrobinson
I mean, I can remove control flow constructs from an imperative language and
call it declarative, but that wouldn't be very useful.

What is the advantage of it being declarative? At least for the example, the
equivalent imperative code is about the same length.

~~~
bryanrasmussen
I don't think there is much advantage in being declarative in this case
because whenever I need to scrape stuff I have to do a lot of edge case
handling and I want to be in control, but it does seem to be a recurring dream
to make some sort of web query language (based on my memories of Old Dr. Dobbs
issues)

------
buster
To be honest, I am not sure why this needs yet another query language.. :(

~~~
ziflex
Actually, it's not yet another language :) The syntax is taken from ArangoDB -
AQL
[https://docs.arangodb.com/3.3/Manual/](https://docs.arangodb.com/3.3/Manual/)

------
anonytrary
I was expecting an ML-driven framework where you write the HTML you want to
scrape, and the framework diffs the trees and attempts to extract the
information from the target tree as best it can to match your input tree.
That's what pops into mind when I think of "declarative" scraping.

    
    
      LET google = DOCUMENT("https://www.google.com/", true)
    
      INPUT(google, 'input[name="q"]', "ferret")
      CLICK(google, 'input[name="btnK"]')
      WAIT_NAVIGATION(google)
      LET result = (
        FOR result IN ELEMENTS(google, '.g')
          RETURN {
            title: ELEMENT(result, 'h3 > a'),
              description: ELEMENT(result, '.st'),
              url: ELEMENT(result, 'cite')
          }
      )
      RETURN (
        FOR page IN result
        FILTER page.title != NONE
        RETURN page
      )
    

Looks an awful lot like:

    
    
      const { document, input, elements, waitNavigation } = require("your-library")
      const scrape = () => {
        let google = document("...", true)
        input(google, "...", "...")
        click(google, "...")
        waitNavigation(google)
        return elements(google, ".g")
          .map(r => {...})
          .filter(p => {..})
      }
      scrape();
    

Am I missing something here? I don't see anything declarative about the the
first one over the second; both of these look identical and rather imperative
to me. Is "declarative" becoming a buzzword (thanks to React, maybe?), or am I
missing something?

------
weego
This looks all wrong. Page scraping is not accessing a data source in a way
that means a query language makes any sense. The moment you need to interact
with the page and admit there's a dom under there it breaks the idiom.

And why it's remaking variable declaration I don't know, and why is the for
loop so verbose? If you insist on a query language go the whole way and remove
repetition and syntax complexity because that's the only thing that could
actually add value.

~~~
ziflex
DOM is a representation of some data. Which means, you can extrapolate the
data and then manipulate it. The language itself has nothing related to the
DOM. All DOM operations are implemented via functions from standard library.

"Good artist copy, great artist steal" I'm trying to be a good artist trying
to not invent a new brand language (I'm not that smart), so I just picked up
(copied) an existing one that fits better for dealing with complex structures
like trees. So it is AQL - ArangoDB Query Language.
[https://docs.arangodb.com/3.3/Manual/](https://docs.arangodb.com/3.3/Manual/)

If you have any suggestions how to improve the language - you are very
welcome.

~~~
rabidrat
How about using an existing language, like Python? You can make a really great
DSL using Python, and then people have access to all the other Python language
features that they already know, and the stdlib that they already know, and
3rd-party modules they already know..

~~~
ziflex
I could, if I knew Python pretty well :) But I've done it in the way I needed
it to be done. I wanted to have an isolated and safe environment that would
allow me to easily scrape the web without dealing with infrastructural code.

~~~
rabidrat
Yup, I get it, I want that too, but I don't want to learn another language
just to do that :/

------
ianbicking
I think this could go further in terms of making it declarative.

A simple declarative approach could taking this:

    
    
        LET google = DOCUMENT("https://www.google.com/", true)
    

and instead of thinking about it as an action (get this page), think about it
as giving you an object. The result is a tuple of the URL, the time fetched,
and maybe other information (like User-Agent). This helps with exploratory
scraping, where you want to be able to repeat actions without always re-
fetching the documents. And you'll be constructing a program, unlike a REPL
where you always write the program top-to-bottom, including all your
intermediate bugs.

Changing DOCUMENT() is easy enough. Things like CLICK() are a bit harder,
though if you extend the data structures you can have a document that is the
result of clicking a certain element in a certain previous document. Again to
do it the first time you have to actually DO the action, but later on perhaps
not. And you'll be constructing interstitial objects that are great for
debugging.

Then what could make it feel really declarative is having more than one
presentation of an execution. You can package up a scraping, and then you can
answer questions about WHY you ended up with certain results.

~~~
ziflex
That's what you can do right now :)

[https://github.com/MontFerret/ferret/blob/master/docs/exampl...](https://github.com/MontFerret/ferret/blob/master/docs/examples/input.fql)

Document, returned form DOCUMENT() function, represents an open browser tab
which allows you to do all interactions with the page.

~~~
ianbicking
Well, that's what I'm saying... right now, making it represent an open browser
tab with a specific state and where everything DOES something isn't
declarative. But it could be declarative if you changed how those commands are
implemented.

Or, to phrase it another way: if the program represents a PLAN then it's
declarative. If it represents a series of things to DO then it's imperative.
It seems like it's doing things, but it could plan things with the same
syntax.

~~~
ziflex
Oh yes. The reason if this is that for now the language itself is DOM
agnostic, it's just a port of an existing one.
([https://docs.arangodb.com/3.4/AQL/](https://docs.arangodb.com/3.4/AQL/)) .
So, the entire DOM thing is implemented by standard library which is
pluggable. In the future, I might extend the language to make it less DOM
agnostic by introducing new keywords for dealing with that. But for now you
have to move document object around. Which is not that bad, because you may
open as many page as you want in a single query.

------
hinkley
In late 2000 at the tail end of the bubble bursting, there was a search engine
company trying to build on a platform of push notifications instead of
crawling.

I know that at my current company the plurality if traffic comes from
crawlers. We don’t want to throttle them because that’s biting the hand that
feeds but it sure sucks.

I wonder often, how many crawlers you need before it’s cheaper for a website
to volunteer up when pages change or new ones arrive.

~~~
th0ma5
Maybe Pointcast?
[https://en.wikipedia.org/wiki/PointCast_(dotcom)](https://en.wikipedia.org/wiki/PointCast_\(dotcom\))

------
janci
Parsing the html or traversing DOM is the easy part. Doing request queues, ip
rotation, data quality management, exponential backoff etc. on scale is much
harder.

~~~
ziflex
PRs are welcome :) There is gonna be a separate project within the
organization that would do all these things and even more. It's just beginning
:)

------
AlphaWeaver
Does anyone else find the narrative style in the README (memes, Internet
language, etc) obnoxious?

------
andrewstuart2
If you're looking for a slightly-more-native way to declaratively scrape (the
data-binding aspect, at least) in Go based on CSS selectors, I wrote a wrapper
around GoQuery that just uses struct tags for mapping. You don't have to learn
a new language, and it should feel familiar if you've ever written CSS or
jQuery. I've found it helps reduce a lot of boilerplate and makes things a lot
easier to come back and read than what was previously a lot of nested GoQuery
functions, selectors, etc.

Might be helpful and in a similar vein. :-)

[https://github.com/andrewstuart/goq](https://github.com/andrewstuart/goq)

------
nerdponx
Web scraping "at scale" ends up being a lot more complicated than blinding
firing HTTP requests. Scrapy, for example, supports arbitrary "middleware"
that can, for example, follow 301 Redirect, respect robots.txt files, follow
sitemap.xml files, etc.

To what extent is this supported (or, to what extent do you plan to support
it?) Similarly, since the front-end language is essentially a compiler, would
it be possible to write an alternative "backend" (e.g. something that
distributes requests across a cluster)?

~~~
ziflex
This package is more like a runtime. There are plans to create a dedicated
server, where you would be able to store your queries, schedule them and set
up output streams like Spark or Flink. For now, it does not respect
robots.txt. But it can be easily added.

Out of the box, there are not scaling mechanism yet, since the project is WIP.
But, it's written in Go, which makes it pretty fast.

One idea of how you could scale it is to run cluster of instances of headless
Chrome, put proxy/load balancer in front of it, and get Ferret a url to the
cluster. It will treat it as a single instance of Chrome. The only problem,
you would need to differentiate request from CDP (Chrome DevTools Protocol)
client, and once a page is open, redirect all related requests to the same
Chrome instance.

------
graphememes
This is imperative scraping, and not as easy as just working with XML (what
the page is anyway).

~~~
ziflex
One of the goals of the project is to hide technical details and complexities
that follow modern web scraping, especially when you deal with dynamic web
pages.

~~~
graphememes
Which is already very easy with puppeteer

I feel in this case, it's solely to stay within a Go ecosystem which seems to
be counter to what matters, business value.

~~~
ziflex
Yes, puppeteer and ferret use same technology under the hood - Chrome DevTools
Protocol.

But, the purpose of the project is not to "be better than". No, the purpose is
to let developers focus on the data itself, without dealing with technical
details of underlying technologies. You can treat it as a higher abstraction
on top of puppeteer / CDP.

It does not really matter whether it's written in JS or Go. main goal is to
simplify data scraping and make it more declarative (even though, some people
say it's not a declarative language :))

------
taf2
I think the name could be a problem? Or is this related to ferret the c
library with ruby bindings?
[https://github.com/dbalmain/ferret](https://github.com/dbalmain/ferret)

~~~
ziflex
Nope, just thought that ferrets are cool :)

------
teabee89
This reminds me of Russ Cox's toy Webscript language:
[https://research.swtch.com/webscript](https://research.swtch.com/webscript)

------
browsercoin
yetanotherscrapingframework

~~~
meritt
It's not so interesting to me that people continue to feel the need to build a
scraping framework (it's an excellent beginner project because it encompasses
so much of web development) but why HackerNews finds scraping frameworks so
interesting. There seems to be a scraping framework at the top of HN every
week or two.

~~~
existencebox
Personal theory, as someone who has done a good bit of scraping over the last
decade:

Extremely common problem space, with a lot of tantalizing opportunities for a
"platform" or "shared language" where one doesn't exist. I see existing
tooling like BeautifulSoup, Scrapy, Selenium, as isomorphisms to the "near
hardware level" of our problem-space-tool-stack, whereas the abstractions and
higher level logic is often defined per the use case.

On top of (or perhaps because of) that, one often writes a lot of boilerplate,
but when it comes to genericize, one often ends up writing the tool that
genericizes within their problem space, and for all the tantilizing
opportunities, the number of "not quite fully intersecting scraping problem
spaces" (and associated tradeoffs/different paradigms) is far more massive
than I considered when I started any of my own scraping tools.

This has lead me to take a very opinionated view with my own tooling, wherein
I build for _ONE SPECIFIC RECURRING SCRAPING PRIMITIVE_. (in my case, treating
the whole world as a stream of events. You want something that can be more or
less massaged into that? Cool; maybe something for you. If not? Probably want
to look at a different set of tools)

------
the_other_guy
I think comments here are unfairly harsh. I really like this innovative idea
of having a dedicated language. If it can see client-side rendered HTML (e.g.
React, Vue, etc...), that would be a whole another level for me since I don't
think this has been made before.

~~~
ziflex
It can! :)

Even more - it can interact with these pages! Here is an example of use of
Google Search page:
[https://github.com/MontFerret/ferret/blob/master/docs/exampl...](https://github.com/MontFerret/ferret/blob/master/docs/examples/input.fql)

~~~
the_other_guy
Wow, thank you. Now I am absolutely in love with your tool!

~~~
ziflex
Great! ^_^

------
rosha
I am currently using
[https://www.npmjs.com/package/proxycrawl](https://www.npmjs.com/package/proxycrawl)
web scraping system that offers anonymous scraping too. But they do not have a
Go library yet. Does the system support anonymous scraping too like over a
proxy? I'd love to use it if I can scrape URLs with proxycrawl API like with
this example:
[https://api.proxycrawl.com/?token=DsFuFiigAZ2Wm6U1BPh7Zw&for...](https://api.proxycrawl.com/?token=DsFuFiigAZ2Wm6U1BPh7Zw&format=html&url=https://www.amazon.com)

~~~
megous
Token?

~~~
rosha
I do not use tcp token, so I could share it for free, i use the api for
javascript token which gives me dynamic content more.

------
holoduke
The best way to scrape already for many years is using a headless browser
plugin. For example phantomjs with nodejs. That in combination with tor or a
large proxy pool is unbeatable by all other alternatives.

~~~
ziflex
This is how it works under the hood. But everything is wired for you ;)

------
sephware
Why is this a language instead of a library on top of an existing language?

Here's what it would look like as a JavaScript (Node.js or browser) library:

    
    
        let g = getDocument("https://www.google.com/", true);
        
        g.input('input[name="q"]', "ferret");
        g.click('input[name="btnK"]');
        
        g.waitNavigation();
        
        let result = g.elements('.g').map(({
          title: result.element('h3 > a'),
          description: result.element('.st'),
          url: result.element('cite')
        }));
        
        return result.filter(i => i.title !== null);

~~~
RussianCow
I'm not sure why you're being downvoted. As far as I can tell from the
examples, there is nothing that this language brings to the table that
couldn't be implemented instead as an API on top of an existing language.

~~~
ziflex
That's true. The difference is how much efforts is needed to do that using
API.

What it brings is just a higher abstraction of that API which lets you easily
to get work done.

~~~
RussianCow
Do you have a more involved example where Ferret really shines, as opposed to
a library with a similar API in JS or another common language? I really don't
mean to be negative, but I just don't see how Ferret is any easier to use than
something like Nightmare[0]. That said, I'm wondering if it's an issue of
communication more than anything, so maybe a different example than the one in
the readme would help.

[0]:
[https://github.com/segmentio/nightmare](https://github.com/segmentio/nightmare)

~~~
ziflex
You are fine, I totally understand your scepticism. And you are right, there
are definitely issues in communication.

First of all, I built it for myself. I needed a high level representation of
scraping logic, which would run an isolated and safe environment. Second, I
needed to be able easily scrape dynamic pages.

So, what I got is: \- high level, declarative-is language, that hides all
infrastructural details, which helps you to focus on the logic itself. that
helps you to describe what you want without worring about underlying
technology. Today, I'm using headless Chrome, tomorrow I will use something
else, but the change should not affect your code. \- full support of dynamic
pages. You can get data from dynamically rendered page, emulate user's actions
and etc. Heck, you can even write bots with it. \- embeddable. now, I have
only CLI, there are plans to write a web server where you can save your
scripts, schedule them and set up output streams.

But the main idea is to provide high level declarative way of scraping the
web. I'm not saying you can't do that with other tools. I'm just trying to
come up with something more easy to work with.

Regarding examples, the project is still WIP, so as more complex features I
get, more complex examples I get. Here is more or less complex, getting data
from Google Search. It's not that difficult, but it showcases the core feature
of work with dynamic pages.

[https://github.com/MontFerret/ferret/blob/master/docs/exampl...](https://github.com/MontFerret/ferret/blob/master/docs/examples/input.fql)

