
I only use an iFrame to crawl and scrape content - natzar
http://www.airovic.com/itsjustaniframe.html
======
hugs
This is how the very first version of Selenium worked. The application under
test was in an iframe, and the test controller was in the parent page. The
Selenium "Remote Control" protocol was later added where the controller would
phone home to a listening web server for commands to relay to the iframe
(basically, AJAX before it had a name. It all mostly worked for the most
common test cases, but we abandoned this approach for similar reasons
mentioned in the article -- the edge case limitations became more and more
frustrating over time. Ultimately, we merged with the WebDriver project, which
was implemented in a more native way, avoiding all the limitations of
automation-via-iframe.

~~~
3xblah
"It mostly worked for the most common test cases, but we abandoned this
approach for similar reasons mentioned in the article -- the edge case
limitations became more and more frustrating over time."

The article only mentions malformed URLs and browser run-time errors. Were
there any other "edge case limitations" that became intolerable.

~~~
Fauntleroy
Some websites will detect the iframe and send nothing instead of the requested
page. I happen to know from personal experience that YouTube is one of the
sites that does this.

~~~
chii
> YouTube is one of the sites that does this.

that's because they have a proper iframe-embed url:
[https://www.youtube.com/embed/quyj70RogxI](https://www.youtube.com/embed/quyj70RogxI)
(instead of
[https://www.youtube.com/watch?v=quyj70RogxI](https://www.youtube.com/watch?v=quyj70RogxI)
)

------
fenwick67
Yeesh, everyone is so critical here. It's just a blog post about how somebody
does occasional one-off scraping across multiple pages using browser devtools.

Yes of course injecting an iframe into a third-party site with devtools isn't
going to replace Selenium. But it's a clever little hack in a pinch. No need
to get upset.

~~~
hugs
Go back far enough in time (2004), and injecting an iframe into a third-party
site _is_ Selenium. I agree, using an iframe is a fun hack!

------
natzar
Wow, I just posted this, went to take a nap, got back and 84 points ¿?!

The site is working now, it was retrieving some localhost scripts.

I was just trying to get some feedback and to check if that document was
interesting, because I was spending a lot of time on it.

For what I see, it seems Selenium does exactly the same, but I would choose
this iframe solution (small-medium projects) anyway.

It's a super small tool that do the job.

Please, let me know if you can fully use airovic.com

------
captn3m0
Won't the same origin policy kick in the moment I try to read the content of
an iframe that isn't on the same origin as my website?

Or is this meant to run on the dev console on the target website? In which
case, the iFrame and the Airovic website doesn't make sense (the electron app
mentioned does sure, but it doesn't exist)

~~~
nodelessness
That was my thinking as well. You cannot access another website's content
opened via iframe via Javascript at all.

~~~
onlyrealcuzzo
I think you first visit the website, then you inject an iframe onto the page
you're currently on, and then inside that iframe you can scrape any content on
that website.

That's at least what it looked like from his examples.

------
pcr910303
Okay, the way I see this is that using headless tools like puppeteer or
selenium is tedious; just trying to... er scrape my HN account's favorites
(AFAIK no API) becomes a task when you have to automate login.

Just typing in and pressing the button is much easier than automating the
task, so that's why the iframe is something useful. You can interact with the
content (without code).

~~~
hugs
The irony is that the iframe approach is exactly how the first version of
Selenium worked. It's a cool hack, but we abandoned that approach over time
because automating iframes couldn't cover all automation use cases.

------
asdfman123
This sounds hilariously n00by because it's VB and Internet Explorer, but
creating an Internet Explorer instance through VB in, say, Excel and then
dumping data into Excel was great because I had full control over my IE
instance.

Okay, I'll stop speaking now and revealing the fact I started my career as a
data guy at a giant corporation instead of a software engineer.

~~~
ackbar03
Haha, loser! Looo sseeerr

------
oefrha
I can do everything listed in benefits with puppeteer, while I can’t even make
sense of what iframe is supposed to achieve here, or how it’s even gonna load
(anyone with a shred of sense would set X-Frame-Options to SAMEORIGIN, subject
to exceptions). The airovic.com site doesn’t work and hilariously attempts to
load two seemingly important scripts from localhost...

I’m very confused about this submission, and even more confused about how it
managed to almost top the front page.

Edit: Having read the code samples, it seems the code snippets are supposed be
run from the same origin in the dev console. A quick and dirty way to
interactively scrape without navigation, I guess? Still not sure what the “all
together: Airovic.com” is supposed to mean, and definitely more limited than
puppeteer.

Edit2: To be fair to the author, they did say

> You cannot bypass their protections without using a HTTP Tunneling
> component.

Which I didn’t see until just now. This is a pretty big caveat though, should
probably be more upfront...

~~~
javajosh
He's traversing the site using the injected iframe. That is, there is no top-
level navigation event, only an iframe navigation event. Then he's gathering
information from the iframe hosted DOM and combining it in the context of the
starting, main page.

I think it's kinda clever.

~~~
amelius
Injecting an iframe into websites could trigger an assertion error, because
the iframe isn't supposed to be there.

~~~
natzar
Hi, that's right. Some sites protect html injection. Twitter protects their
site very well, but if you use http tunneling they can't do nothing as you can
modify X- headers.

~~~
stockkid
Interesting, could you explain what you mean by http tunneling and how it can
bypass protection against html injection?

------
elierotenberg
For this kind of tasks I usually create a private Firefox extension which
gives me access to extended browser capabilities and the ability to lift some
security-related restrictions. I run it in a sandboxed browser, much like I
would do with something like Selenium or Puppeter, but I have much more
options to hand-tune the automation.

------
ravenstine
Depending on the nature of the content being scraped, you can use the
`sandbox` attribute to the iFrame to prevent scripts from running.

[https://developer.mozilla.org/en-
US/docs/Web/HTML/Element/if...](https://developer.mozilla.org/en-
US/docs/Web/HTML/Element/iframe#attr-sandbox)

This was useful for a brief period when I ran a news aggregator that used
iFrames to display content from other news websites. Adding the sandbox
attribute prevented scripts, ads, modals, etc.

For the purpose of scraping, unless you're always on the same domain(or
running a proxy to add CORS), I don't see how an iFrame is better than either
a web extension or a backend script using Puppeteer.

------
heyplanet
Is the only reason for the iframe so that it is possible to keep a state in
the top frame while loading different pages?

Because otherwise - since you use the dev tools to inject the iframe - you
don't really need the iframe. You can just run it as a "snippet" in Chromium
or from the multi-line-code-editor in Firefox.

Both have the problem that it all has to be a single file. It would be much
nicer if one could import modules.

~~~
bluntfang
>Both have the problem that it all has to be a single file. It would be much
nicer if one could import modules.

Isn't this a solved problem in javascript land? Just use a compiler/minifier
and your module oriented js code is in a single file as a build artifact.

~~~
heyplanet
Then every time you want to make a change to your code, you have to go to your
original codebase, make the change, start the compiler, copy the output and
paste it into your dev tools ...

------
3xblah

             $iframe.contents().find('.result-row').each(function(){
             data.push({
                             title: $(this).find('.result-title').text(),
                             img: $(this).find('img').attr("src"),
                             price: $(this).find('.result-price:first').text()
                 });
             // And everything starts running when you set first iframe's target url
             $iframe.prop("src", "https://newyork.craigslist.org/d/apts-housing-for-rent/search/apa");
    
    

Looks like he wants output something like

    
    
             title: 
             img:
             price:
    

I tried reproducing this example without using Javascript, instead using curl
and sed. The output is

    
    
             image: 
             title:
             price
    

I did not try to move "title:" above "image:" though I bet this could be done
using the hold space. Nor did I format this as JSON though that would be easy
to do.

    
    
             n=0;while true;do test $n -le 3000||break;
             curl https://newyork.craigslist.org/d/apts-housing-for-rent/search/apa?s=$n|sed -n '
             /result-title hdrlnk/{s/.*\">/title: /;s/<.*//;/^title: /p;};
             /./{/result-meta/,/\/span/{/result-price/s/.*\">/price: /;s/<.*//;/price/p;};};
             /data-ids=\"/{s|1:[^,\">]*|https://images.craigslist.org/&_600x450.jpg|g;s/,/, /g;
             s/1://g;s/>//;s/.*data-ids=/image: /;/^image: /p;}'
             n=$((n+120));done

------
ChrisSD
I've done something similar in Firefox with scratchpad. The main reason is
simply convenience. I don't need to switch to a different workflow, I merely
bring up scratchpad (I often already have a window open with some utility
functions) and can start hacking away immediately.

Sadly scratchpad is going away soon. Fortunately the console now has a
multiline mode, unfortunately it's not as convenient for this use.

------
kseo3l
why don't use something like proxycrawl? controlling an iframe is slow and
painful

------
GiantSully
You can even inject a browser extension to chrome with selenium, or even back
the selenium with an upstream proxy. So why iframe, what's the edge?

------
ausjke
do not understand why iframe is a must here, why can I just scrape the whole
page directly? still learning web scraping using scrapy.

~~~
hayksaakian
I think the basic utility is that you keep your parent-frame JavaScript
context.

Normally if you click a link with jQuery, you lose the current context after
the next page loads.

By controlling it inside an iframe it's more convenient

------
thenewnewguy
Maybe I'm missing something obvious, but can anyone explain to me how this is
better than using a tool like selenium for scraping? I guess this might be
easier to quickly setup and play around with for one-off scraping?

------
gmac
I describe something very similar here: [https://github.com/jawj/web-scraping-
for-researchers](https://github.com/jawj/web-scraping-for-researchers)

------
iamleppert
Working for a big tech company, stuff like this infuriates me.

It’s exactly why we’re currently pushing for the ability to disable developer
tools, we want it added to Chrome and other browsers. I should be able to, as
a web site owner, not allow any kind of developer tool usage.

Users do not own our product and have no right to go poking around like this!

~~~
natzar
You are talking about websites, not native apps. Make a native app.

