
Monitor web page changes with Go - ssimunic
http://silviosimunic.com/blog/monitor-web-page-changes-with-go/
======
Xeoncross
When making HTTP requests with Go (or any language) you should at least
consider the following:

\- Setting a User Agent (req.Header.Set(Key, Value))

\- Setting Timeouts ([https://blog.cloudflare.com/the-complete-guide-to-
golang-net...](https://blog.cloudflare.com/the-complete-guide-to-golang-net-
http-timeouts/))

\- Avoiding ioutil.ReadAll() and checking body length (or using
[https://golang.org/pkg/io/#LimitReader](https://golang.org/pkg/io/#LimitReader))

\- Caching DNS lookups ([https://github.com/viki-
org/dnscache](https://github.com/viki-org/dnscache))

~~~
spraak
Do you mean

> Avoiding ioutil.ReadAll

AND

> checking body length

or do you mean,

> Avoid ioutil.ReadAll()

> and [instead you should be] checking the body length

?

Do you have an example of that, or an example of LimitReader?

~~~
piotrkubisa
You made me curious, so I've prepared a simple benchmark -
[https://gist.github.com/piotrkubisa/8ccc308086378da7d2f6c0d5...](https://gist.github.com/piotrkubisa/8ccc308086378da7d2f6c0d550aded33)

~~~
spraak
That's super awesome, thank you

------
nemo1618
>we will need to define two global variables

Or just pass them to the request functions. And why define the global
LOOP_EVERY_SECONDS in terms of seconds, rather than a time.Duration?

This code also doesn't handle errors at all. And it can exit any time from
within checkChanges. And it compares data by converting the []byte to a
string. Use bytes.Equal instead.

Finally, wouldn't this be a bit more useful if it printed a diff of the old-
vs-new?

~~~
pimlottc
Handling errors is 90% of the job for a page monitor like this. Have fun
getting spammed when the site goes offline briefly, or when your internet
connection goes down.

~~~
ssimunic
Good point. Added error checking.

------
jasode
This is an interesting submission for educational purposes but I want to warn
folks who might be inspired to "roll their own" website change monitoring
tool. I started writing a similar tool in C# and since I was hoping to create
a _generalized_ tool for any website, I had to deal with lots of tricky edge
cases. (E.g. raw HTML that changes vs visible text that doesn't, copyright
dates on footers that get changed but the real meat of the page is still
exactly the same triggering false alarms, etc.)

I gave up and simply used Aignes Website-Watcher.[1] If you're still itching
to program your own webpage change detector, I recommend reading through their
16 years of release history[2] to get an idea of the various bugs they had to
solve. The scope of edge cases can be overwhelming if you want to write
something very robust.

[1] [http://aignes.com/index.htm](http://aignes.com/index.htm)

[2] [http://aignes.com/wsw_history.htm](http://aignes.com/wsw_history.htm)

~~~
Xeoncross
+1 Read the release logs of existing projects. I wish more projects actually
took the time to talk about the problems they solved since those issues
usually outlive their projects as new alternatives pop up.

------
dbg31415
How would one use this with pages that have dynamic ads on them?

Every production site should have content (at least for any page that gets
more than 5% total traffic), uptime / load time, application performance,
server health, and security monitoring.

Can't count the number of times content monitoring bailed me out... a CMS
change went live to early, or someone released the wrong branch to prod, or a
DNS subscription I didn't control expired... I usually just string-check the
Title or H1 on the page but would be great to have more advanced tools.

------
plg
I have always wanted a program/script to check for changes in the NIH era
commons site, specifically a page that indicates for a particular grant
application, what its score is. Problem is, to get to this magic page (which
is, I think, dynamically created), you have to:

(1) log into the main era commons site with a username and pw

(2) then a new page opens up, on which you have to click on a particular link
to get a list of your grants

(3) then click on the grant of interest

(4) then click on the competition date

(5) then the magic page opens up. It's this page I want to check for changes

I've done the requisite google searching and found lots of generics
suggestions about cookies, etc, but I don't have the skills, obviously.

Is there a tool whereby I can click "record" and go through my various steps,
and then the tool will keep track of the programmatic aspects of this surf-
and-click dance, such that I can execute that script later (e.g. as a cron
job)?

Thanks!

~~~
abhirag
Another option could be to use mechanize
([https://pypi.python.org/pypi/mechanize](https://pypi.python.org/pypi/mechanize)).
Originally written in Perl, then ported to Python as well as Ruby.

~~~
plg
Got it done with Python/Mechanize! Thanks for the suggestion!

------
jhoechtl
A cryptographic hash is overkill, murmur or cityhash would be much more
appopriate.Furtheron, it might make sense when using one of the mentioned
hashes to accept a minimum of change, like a timestamp which has changed on
the site , but that varies of course on the use case.

------
zlagen
This will break in a lot of pages as the html will be different on each
request in most pages although the content will be the same. Perhaps it would
be better to compare visible text rather than the raw html.

~~~
arjie
Neat. Do you have a recommendation for a headless browser library that would
simplify this? The difference between checking the raw HTML and visible HTML
can be pretty big.

~~~
altavox
In client-side Javascript, document.documentElement.innerText yields only the
rendered text, without markup, CSS or JavaScript source.

~~~
arjie
But the theory was that the source would be different, and only the rendered
content would be the same at the sample place, right? How would you find the
'same place' without fully rendering the page?

~~~
tedmiston
One way is for the source to use cache control headers so you can just do a
quick HEAD instead of a GET on the whole contents.

------
AdamSC1
Does anyone have insights on how to refine this to only monitor specific
elements of the page rather than the entire page?

I may not care if they change a template but would care if they change a price
or a warrant canary etc?

~~~
wlkr
I suspect you'd want to search for a specific attribute or tag combination, I
used the scrape package [0] recently and found it a pleasure to use. Beyond
the example provided in the README the documentation [1] is also good.

[0] [https://github.com/yhat/scrape](https://github.com/yhat/scrape)

[1]
[https://godoc.org/github.com/yhat/scrape](https://godoc.org/github.com/yhat/scrape)

------
spraak
Tangent: why is the Go mascot always so deformed[1]? It makes me sad because I
do love Go but having such an awkward mascot makes me feel like the language
is deformed, too (which some people would say is true, I admit)

[1] Yes, I'm familiar with the history of its creation, I just wish it weren't
so

~~~
artpar
Do you judge a product from its logo/mascot ?

~~~
grzm
And books by their covers!

Your parent clearly mentions it's a tangent, and likely not a big deal. I'll
go a bit further and admit that the choice of logo or mascot does indicate
something about the project, whether it be a sense of style or respect for
tradition or history. It's by no means the most important criteria, but it's a
datum nonetheless.

Language creator hairstyles are also important signals, by the way.

~~~
spraak
Thanks for understanding. I think that "don't judge a book by its cover" isn't
an absolute maxim to live by. In fact, yes, I do judge a book by its cover.
Not ENTIRELY by its cover, but as you point out, it's not irrelevant.

------
ssimunic
Just a note and reply to the comments made so far:

I agree that there are flaws with this as zlagen said (where the page is
different in some other parts of HTML).

I've made this according to my need, and that was to check a page for only one
number change, where everything else wasn't changing. This is meant to
demonstrate general idea behind making something like this.

nemo1618, I agree that code should handle errors and you should definitely do
that. For printing out old vs new version, as I said, for my needs, that
wasn't needed since I just wanted to get emailed once one number on page
changes.

