
Web Scraping with Electron - tazeg95
https://en.jeffprod.com/blog/2019/web-scraping-with-electron/
======
Dunedan
> Is there a better way to surf the web, retrieve the source code of the pages
> and extract data from them ?

Yes, of course! To get the source code of a web site you don't need a browser
and all its complexity. It makes me so sad how far we have come in terms of
unnecessary complexity for simple tasks.

If you want to extract data from web pages without requiring hundreds of
megabytes for something like Electron, there are lots of scraping libraries
out there. There are for example at least two good Python implementations:
Scrapy[1] and BeautifulSoup[2].

[1]: [https://scrapy.org/](https://scrapy.org/)

[2]:
[https://www.crummy.com/software/BeautifulSoup/](https://www.crummy.com/software/BeautifulSoup/)

~~~
tazeg95
Sure, but i meant to build a portable app, for end users who are not coders,
with a GUI, and for a dedicated purpose, like for exemple navigating on
facebook.

So i will edit this question to this : Is there a better way to code a
portable application with a graphical user interface to scrape a given site ?

Thanks for your comment.

~~~
rasengan
You can access the html of the website and use regular expressions.

~~~
tazeg95
> You can access the html of the website and use regular expressions.

Yes but using regular expressions is the last and least recommended solution,
please read : [https://stackoverflow.com/questions/3577641/how-do-you-
parse...](https://stackoverflow.com/questions/3577641/how-do-you-parse-and-
process-html-xml-in-php/3577662)

~~~
rasengan
If you read that link, it’s only not recommmended because people don’t know
how to use it. Regular expressions are powerful.

~~~
stareatgoats
Read the link. Just wondering how you managed to interpret this:

> regular expressions is a waste of time when the aforementioned libraries
> already exist and do a much better job on this.

as this:

> it’s only not recommmended because people don’t know how to use it

~~~
rasengan
> [https://stackoverflow.com/questions/3577641/how-do-you-
> parse...](https://stackoverflow.com/questions/3577641/how-do-you-parse-and-
> process-html-xml-in-php/3577662)

It says "can make regex fail when not properly written" etc.

There are different circumstances where using a premade parsing library versus
using raw regular expressions are going to make sense.

The answer is not binary.

------
TicklishTiger
I wish there was an easy way to send commands to the console of a browser.

That would be all I need to satisfy all my browser automation tasks.

Without installing and learning any frameworks.

Say there was a linux command 'SendToChromium' that would do that for
Chromium. Then to navigate to some page one could simply do this:

SendToChromium location.href="/somepage.html"

SendToChromium should return the output of the command. So to get the html of
the current page, one would simply do:

SendToChromium document.body.innerHTML > theBody.html

Ideally the browser would listen for this type of command on a local port. So
instead of needing a binary 'SendToChromium' one could simply start Chromium
in listening mode:

chromium --listen 12345

And then talk to it via http:

curl 127.0.0.1:12345/execute?command=location.href="/somepage.html"

~~~
dlkinney
While not currently "easy", there exists the Chrome Devtools Protocol.[0] I'm
not aware of a CLI utility that communicates with it, but it wouldn't be
impossible to make one that fulfills what you're looking for. A second tool
could then act as a REST proxy, if calling the commands via curl is really
your jam.

I think you've given my weekend some purpose. Lemme see what I can pull
together...

[0] [https://chromedevtools.github.io/devtools-
protocol/](https://chromedevtools.github.io/devtools-protocol/)

~~~
TicklishTiger
Yes, that might work. Maybe an even better approach is to use chromium-
chromedriver.

I just got it working like this:

    
    
        apt install chromium-chromedriver
        chromedriver
    

This seems to create a service that listens on port 9515 for standardized
commands to remote control a chromium instance. The commands seem to be
specified by the W3C:

[https://www.w3.org/TR/webdriver/](https://www.w3.org/TR/webdriver/)

I got it to open a browser with this curl command:

    
    
        curl  -d '{ "desiredCapabilities": { "caps": { "nativeEvents": false, "browserName": "chrome", "version": "", "platform": "ANY" } } }'  http://localhost:9515/session
    

I have not yet figured out how to send javascript commands though.

------
aboutruby
Interesting but seems less powerful than my current setup:

\- I have mitmproxy to capture the traffic / manipulate the traffic

\- I have Chrome opened with Selenium/Capybara/chromedriver and using
mitmproxy

\- I then browse to the target pages, it records the selected requests and the
selected responses

\- It then replays the requests until they fail (with a delay)

I highly recommend mitmproxy, it's extremely powerful: capture traffic, send
responses without hitting the server, block/hang requests, modify responses,
modify requests/responses headers.

Then higher level interfaces can be built on top, Selenium allows you to load
Chrome extensions and execute Javascript on any page for instance. You can
also manage many tabs at the same time.

I could make a blog post/demo if people are interested

~~~
kuhhk
A blog article would be nice. Sounds interesting, but I’m having a hard-time
understanding. If it’s replaying requests, how do you get it to do things like
go to the next pagination and click on all of the next paginated results?

~~~
aboutruby
In my case I can't do the pagination automatically so I have to fetch the
pages myself to then have them replayed.

In most cases you would capture the request and change the "page=" parameter
(either for an HTML page or an API).

You could also use selenium to click on each "next page". Could be
parallelized with multiple tabs / windows.

The only website that blocks me is Bloomberg because they detect mitmproxy (I
didn't care enough to make mitmproxy harder to detect).

Another detail is that regular Chrome doesn't let you load insecure
certificates while chromedriver allows that.

Anyway, I will write about all that, I already posted some code on my Twitter:
[https://twitter.com/localhostdotdev](https://twitter.com/localhostdotdev)
(that I will turn into a blog).

------
CGamesPlay
I'm going to plug my app that does scraping with Electron:
[https://github.com/CGamesPlay/chronicler](https://github.com/CGamesPlay/chronicler)

To the commenters who don't understand why this is necessary:

\- It reliably loads linked resources in a WYSIWYG fashion, including embedded
media and other things that have to be handled in an ad-hoc fashion when using
something like BeautifulSoup.

\- It handles resources loaded through JavaScript, including HTML5 History API
changes.

~~~
aloer
Could you explain what electron offers here over for example a browser plugin?
I'm not that familiar with limitations of WebExtension APIs

It looks like an interesting project but only for a few selected sites. For
more random browsing I believe I would be too security conscious
([https://electronjs.org/docs/tutorial/security](https://electronjs.org/docs/tutorial/security))
to allow it

Might be unlikely that some random script on a random site will target
Electron but you never know

~~~
CGamesPlay
Well in Chrome you can't hook into the network layer to record/replay requests
like I've done here (you could fake it by overriding the ServiceWorker,
possibly, but this feels brittle and I'm not sure if it's possible either).
I'm not familiar with Firefox extensions but I'm given to understand you
likely could implement this project as a Firefox extension.

Locking down Electron apps to be safe on the larger web is certainly one area
where I think electron could do a lot better. I think my project has followed
all of the recommendations and should be safe, but I agree with you that it
feels like a bigger attack surface. I personally wanted it to support offline
browsing of documentation sites, which are generally pretty "safe" from that
perspective.

------
SSchick
Are there any other advantages over things like webdriver or puppeteer?

~~~
nkozyra
Not really.

I also have no idea what cheerio brings to the table here.

Seems like a hefty solution to web scraping

