
Show HN: Using ML to detect key changes to 2020 Candidates' Websites - bluepeter
https://deepdiveduck.com/campaignmonitor/
======
giancarlostoro
These look like diffs to me, which I'm okay with. Not sure where the ML will
come to play, and honestly, just calling it a site that shows "updates" or
"changes" to 2020 candidate sites is good enough.

~~~
bluepeter
Thanks... the bulk of the site is made up of diffs. We score them w/ our ML
model. We aren't (yet) surfacing those scores on a page by page basis. Rather,
we are simply collecting the stand-out examples detected by ML in the top card
as links to those specific diffs.

Our bad for not including the scores for each page... our thoughts w/ this
content piece were it would be attractive to journalists who may not have an
appreciation for underlying scores.

------
alexcnwy
Very cool idea.

It's not clear where you're using ML from the site - URLs seem chronological,
not ranked on "relevance/importance" and I can't see any relevance/importance
indicators.

I'm curious to hear some more detail on how you're encoding your visual diff
model?

BTW I ran into 2 issues: 1\. You can't zoom out on your image diff slider 2\.
I got this error when I returned to the site after closing it: {"crossDomain"
: true, "method" : "GET", "url" : "https : //api.fluxguard.com/publi
c/site/7f646558-f754-447a -b627-9b5202c8a1f2/page?l imit=10&publicAccount=cam
paignmonitor"} Please contact us if this is error is happening frequently for
you.

~~~
bluepeter
Thanks for the feedback... right now we are only using extracted text (and
some DOM data) for ML. We aren't using the images for any ML work because, as
you likely suspect, that's pretty hard to do in a meaningful way.

We aren't surfacing the per page scores at the moment... most of the ML work
was done specifically for this content piece, so we haven't adjusted our core
presentation to include it (including the flag grading system which we use to
build the training model), other than simply listing stand-out examples at the
top of the site.

For our customers, we likely will need to build industry/use-specific models.
(Amazon is already sort of doing this by providing pharm-specific text
classifiers.) Use cases are so disparate at the moment that it's hard to build
a general model for everyone.

As for the cross-domain errors, grrr! Thanks, we will look into it... hard to
troubleshoot those. Our API stack consists of CloudFormation -> API Gateway ->
API Gateway Caching -> Lambda -> DynamoDB... ( _edit_ AKA who knows where
those errors are! haha)

(And more detail on the problem you are having w/ image diff slider would be
appreciated... feel free to email us directly from email at bottom of site.
Not sure I understand this issue you're having.)

( _edit_ feel free to email me w/ error details at peter (at) deepdiveduck .
com ... the cross-domain scripting errors are a constant thorn in our sides so
we'd like to get any other info you have on 'em)

~~~
darepublic
Are you sure the error is Cors error? The data says crossdomain but the error
object posted doesn't seem to actually contain much info on the cause. unless
crossdomain true is supposed to tell u that

~~~
bluepeter
Yeah, I am not sure it is a CORS error, though that wouldn't surprise me. This
looks to be an error from our API Gateway validation rules (which, at least
for us, is notoriously difficult to get to send any more error data to the
client)... that is to say, this will typically occur when one of the form or
query string params is sending illegal data. I tried to repro last night w/ no
luck. This sort of error isn't in the error logs (that we typically
monitor)...

------
bluepeter
Hi folks: So we're monitoring most major 2020 Presidential Candidates' sites
for visual + HTML/DOM + network + extracted text changes. (You can see all
detected changed at the above link.) There's a lot of noise! So we're using ML
to identify significant changes. (You can see these findings so far at the top
of the page.)

We've trained our model using detected changes from corporate sites and some
earlier political sites. Each change for our model was human-rated in terms of
relevance/importance, and we also feed in other descriptive attributes about
each change, such as DOM location, immediate parent tag, and several other
attributes.

~~~
social_quotient
Congrats on what you’ve made!

What sort of noise are you running in to? (Curious)

In an old project we actually did something a bit more based on visual
changes. Basically detected visual diffs and got coordinates of offsets to the
change. We then found the smallest html container that encompassed the diff
and highlighted it.

Using visual diff you can fuzz things a bit to handle artifact and small
movements.

A good lib we have some miles on
[https://github.com/mapbox/pixelmatch](https://github.com/mapbox/pixelmatch)

And this write up on niffy is pretty good

[https://segment.com/blog/perceptual-diffing-with-
niffy/](https://segment.com/blog/perceptual-diffing-with-niffy/)

~~~
bluepeter
In terms of overall noise, we're running into a lot. (This is what led us to
ML as a way to hopefully reduce it.)

Comparing the pure DOM reveals almost constant change due to various inserted
Javascript from Google/Facebook/etc modifying the DOM. Looking at text/images
also result in a fair amount of expected noise, but it's mostly in the form of
interstitial marketing banners, fundraising targets, etc.

We already have various options to "filter" out certain DOM areas. (As an
example, remove all footers, headers, or any other CSS selector.) These work
really well... but they require a fair bit of setup.

Thanks for the pixelmatch GH repo link... I have it starred so must have taken
a look in the past. We need to evolve our image diff, so we may end up using
this!

