
Show HN: Web scraping page analyzer - jardah
https://www.apify.com/page-analyzer
======
tcmb
I entered a URL and pressed enter, wondering why nothing happened. Only then
did I scroll down to find the 'Analyze' button. I wasn't looking for specific
attributes, and the strong color contrast of that section made it look like
nothing else of interest would come below.

~~~
jardah
Oh... Clearly I need to work on my UX skills, I will improve that in next
iteration.

------
jardah
Just a quick update: Thank you for using it and playing around with it.
Looking at the usage and results I found a quite a lot of things to improve.
Which is great, since it's hard to develop something like this without real
usage data.

------
at_smith
Awesome tool! How do you handle scraping data that's hiding behind layers of
~fancy~ JS libraries? Is it as simple as triggering click events, pausing for
loading, and then grabbing the information?

~~~
jardah
This tool basicaly performs the simplest data loading, it opens the webpage,
then waits till most xhr requests are done, wait's a second (tio give JS time
to manipulate DOM) and then loads data from the page. This way, it has what
user sees when he opens the page in browser. So if the data is visible, or
loaded through XHR or hidden in global JS variable it will see it.

For more advanced usage (like clicking, or submiting a search request) it
would need to have some kind of scenario like: "Click on this" -> "wait till
this loads" -> "type something here" -> "scroll to this" -> load data.

Which is possible with headless chrome, so the trick is to make it general and
easy to use (something like recording what user does through chrome plugin).
Maybe in future versions :)

~~~
cseelus
Could be an interesting enhancement. Sounds a little bit like what Capybara, a
test framework for Ruby apps can do[1], things like

    
    
      click_link('Link Text')
      fill_in('Password', with: 'Seekrit')
      choose('A Radio Button')
      check('A Checkbox')
      uncheck('Another Checkbox')
      select('Option', from: 'Select Box')
    

1)
[https://github.com/teamcapybara/capybara#navigating](https://github.com/teamcapybara/capybara#navigating)

------
nreece
Cool tool!

* _shameless plug_ * Our little startup, Feedity - [https://feedity.com](https://feedity.com), helps create custom RSS feeds for any webpage, utilizing Chrome for full-rendering and many other tweaks & techniques under the hood for seamless & scalable indexing.

------
cstrat
Looks awesome! Does the tool work when trying to access websites behind web
application firewalls? eg. F5 WAF [1]

[https://f5.com/glossary/web-application-
firewall](https://f5.com/glossary/web-application-firewall)

~~~
jardah
Depends on whether we access the website from a proxy that is known by the
WAF. But for most websites it's just a single normal request. If it's an issue
in the future we could make browser extension, that will do the analytic on
page loaded by the user, so that we don't have to use proxy to connect to it.
If you are talking about actually scraping the websites, then that is usually
on case by case scenario. Mostly it works, but sometimes it's a bit harder to
get around.

------
guilamu
Not giving me anything useful on this pretty straightforward table:

[http://www.dsden93.ac-
creteil.fr/spip/spip.php?page=ecoles](http://www.dsden93.ac-
creteil.fr/spip/spip.php?page=ecoles)

~~~
jardah
Yes, that is probably the problem, when I looked for the text it returned:

[ 0:{ "selector":".bloc-blanc > p:nth-child(1)" "text":" 0 école(s)
correspondent à votre recherche " } ]

~~~
jardah
Aha! I see, it shows data based on POST request from FORM on this page
[http://www.dsden93.ac-
creteil.fr/spip/spip.php?page=annu1d](http://www.dsden93.ac-
creteil.fr/spip/spip.php?page=annu1d) so if you provide just a link to the
results page without the POST data then it will show you nothing. Sadly the
tool currently does not allow for sending POST requests to the websites.

~~~
guilamu
Thanks for your replies, I've successfully been parsing this page with others
parsers though.

Edit: the page changed and it's not working anymore. Sorry for the false
alarm, my bad.

------
Kikobeats
Similar but just for getting normalized metadata:
[https://microlink.io](https://microlink.io)

------
jardah
I'm still testing it and improving it (there are so many different websites
with different responses...), so If you have any comments I'm looking forward
to what you think about it.

~~~
JustARandomGuy
Suppose I wanted to extract an image that gets loaded async via Javascript
(For example, a Pinterest page). How would that work? Looking at your
documentation, it looks like I could parse the XHR array you supply. Could you
suggest any other ways? I'm calling out Pinterest as an example here because
they try to block their images from being easily downloaded, but if you have
any other examples I'd like to hear them.

It would be great if the page analyzer could supply a list of all the assets
loaded with the web page; for example, any asset with a media type of image/*
is listed in an images array, and so forth.

~~~
jardah
Actually the list of assets shouldn't be that hard. Looking at pinterest the
xhr requests for images are loaded immediately when page is open, so
potentialy it then it's catched in onRequest function (only now I'm aborting
the requests to save network trafic). I will try it our tomorrow and let you
know in comment.

Also, looking at pinterest, it's server rendered through ReactJS, so there is
#initial-state script tag with first few images preloaded as urls, so if you
cared only about the images on top without scrolling then this is the safest
bet.

------
chadlavi
We use this for some stuff at my office. It's handy.

------
dmarlow
Is data shared between accounts if two accounts both want to retrieve
information from the same exact URL?

~~~
jardah
Nope, there is no caching now, every run of the tool has a single instance and
writes the output into separate file. I'm using it to test stability of cloud
when multiple users are using it and to test proxies. It would not be much of
a test if one user opened the demo page and then every other use would just
get the results from a file. But when I'm happy with how it works I will add
caching.

------
oevi
Nice work! It seems that it only supports microdata and not RDFa at the
moment?

~~~
jardah
Yep only microdata. I completely forgot about RDFa. I'm immediately writing
RDFa to my todo list. It would be a great addition.

~~~
anomie31
Speaking of which, do you think you could support more ontologies than
schema.org? It's easy to use schema.org without understanding the rest of the
RDF ecosystem, so I'll elaborate in a minute, but I'm on my phone right now so
it's difficult.

------
BrandoElFollito
Is there a way to access an authenticated web site?

~~~
jardah
Sadly not for now. Our company has a solution for that (for some websites),
but currently this tool does not have this functionality, since I wanted it be
as simple as possible. Maybe in the future.

~~~
lanewinfield
Authentication support on this would make it an instant purchase for me.

~~~
jardah
Some general authentication (like separate input fields for your login
credentials on the website) could be potentialy done (but very unsafe for user
of the tool, since you would be sending us your credentials as plaintext). But
authentication as whole is sadly not as general as semantic data on the web.
Not every website has the same login form(different fields), some use
captchas, some use authenticators, some do robot checking for too fast logins.

------
nickthemagicman
Just wanted to say great tool!

~~~
jardah
Thank you, still lot's of things to improve (for example 404 handling) but
it's great to see positive feedback.

------
petagonoral
hmm, no joy getting rating information from an amazon.com product page

~~~
jardah
Amazon is unfortunately not using any metadata information for reviews
(probably to prevent easy scraping for competing companies). You can only get
it from from html (At least from what I can see).

------
alexroan
Love this.

