
Show HN: Checkbot for Chrome – web crawler that tests for web best practices - seanwilson
https://www.checkbot.io
======
gerdesj
Very cool. I'm giving it a quick go whilst doing yet another patchathon on
customer systems.

First impressions are that it is very quick and gives some great advice. I'm
not really a web. dev. but I even I can see how this can make a good audit
tool. Looks great as well.

I suggest caution against using the term "best practice" though. It's one of
my pet hates - there is good practice and there is bad practice but its a
brave person who claims to know best practice. I think my hatred of that term
stems from seeing it plastered all over older MS docs and the usual crop of
"me too" copy n paste blog postings that litter the web, not to mention
various forum postings. We're all bloody experts who know best in this game 8)

~~~
keithnz
I agree, it runs nicely, and is nicely presented. The term best practice does
irk me. But weirdly enough I think the use of that term is why I decided to
give it a go.

~~~
acct1771
Love-hate relationship with marketing is probably pretty common in this crowd!

------
ratata
Nice. very similar to Lighthouse
[https://developers.google.com/web/tools/lighthouse/#devtools](https://developers.google.com/web/tools/lighthouse/#devtools)

~~~
wgjordan
Similar feature-set at first glance, but not open-source, not free after beta
ends, and created/maintained by an unknown solo developer. Nice site design,
but I think the cards are pretty heavily stacked against this one.

~~~
seanwilson
The big difference is Checkbot crawls whole websites as opposed to checking a
single page at a time. The Checkbot interface is designed around helping you
hunt down issues that impact groups of pages and pages you didn't think to
check. For example, this lets you find duplicate title/description/content
issues and root out pages with broken links and invalid HTML you don't look at
often.

~~~
wgjordan
That does sound like a key differentiating feature, thanks for clarifying.
While I'd probably prefer to hook up an open-source web-crawler to lighthouse
(e.g., something like github/lightcrawler [1]), I could see SEO/marketing
experts in particular paying for a user-friendly all-in-one solution like this
versus cobbling something together from open-source tools.

[1]
[https://github.com/github/lightcrawler](https://github.com/github/lightcrawler)

~~~
seanwilson
Yes, so I think the UI is a really important factor here in terms of
productivity and ease of use. For example, after Checkbot has scanned your
localhost/development site and identified a page has the same title as other
pages, you can edit your site, hit the "recrawl" button for that page and
confirm your fix worked in a few seconds. Users I've worked with so far have
really appreciated this fast and simple workflow.

------
queezey
Very cool.

I noticed that it's mangling some of my URLs, though.

`/!0ead1aEq` is getting turned into `/%210ead1aEq` (the exclamation point is
getting percent-encoded), which leads to a bunch of spurious 404 errors.

[https://tools.ietf.org/html/rfc3986#section-3.3](https://tools.ietf.org/html/rfc3986#section-3.3)

~~~
seanwilson
Ah, thanks a good bug report! I'll get this fixed.

------
seanwilson
Hi, I didn't get much feedback when I posted last time so I'm giving it
another try. This is aimed at helping web developers follow SEO, performance
and security best practices so I'd love to know what the community thinks. Can
you think of any changes that would make Checkbot more helpful? Did you notice
any bugs? Thanks!

~~~
cdawg0
Cool, I'm checking it out right now. Can you share what your tool provides
that others don't?

~~~
seanwilson
The major one is that instead of manually checking pages one at a time,
Checkbot lets you easily test 1,000s of pages in a few minutes to root out
issues you'd normally miss. As you're doing web crawls from your own machine,
you can also crawl any site you want as often as you want including
localhost/development, staging and production sites. This lets you identify
issues early and confirm fixes during development before problems go live.

~~~
cdawg0
Interesting. So the intent is to tackle problems before deployment. Do you
plan on any devtools integrations so it can be used as part of an automatic
CI/CD process? Also, does Checkbot dig into all dependancies or skip them like
some others?

~~~
seanwilson
> Interesting. So the intent is to tackle problems before deployment. Do you
> plan on any devtools integrations so it can be used as part of an automatic
> CI/CD process?

A lot of the time, websites owners don't know there's a problem until their
search results or Google Search Console updates. So I'm seeing it being used
by developers to check localhost/development sites, then on staging for other
problems, then on production when changes are made there. A command-line
version to support CI/CD is something I'm really interested in as well.

> Also, does Checkbot dig into all dependancies or skip them like some others?

Can you expand on what you mean here?

------
mkorsak
Seems really cool, I like this. One issue I'm having so far though is that
after crawling my site it found one 404, and the link is (my domain)/page-not-
found-test. It also says there's 0 inlinks to it, so I have no idea where it's
getting the idea for this page from. It doesn't exist, but I've never linked
to anything like it?

------
mahesh_rm
I rapidly tested it. It's staying on my chrome. Good Job.

~~~
seanwilson
Thanks, let me know how you get on!

------
santoshmaharshi
Very nicely thought through and implemented i agree with you on the best
practice point. Already forwarded to few folks and going to stay with my
browser.

~~~
seanwilson
Thanks! Let me know if you can think of any improvements I can make.

------
Raphmedia
Works pretty good. Incidentally, I just locked myself from my own server. This
double as a security tester! :o)

~~~
seanwilson
On the left sidebar at the start you can modify the number of URLs crawled per
second if that helps!

------
chasers
It's nice and fast! Are you actually using chrome to render each page or
making requests some other way?

~~~
seanwilson
Thanks! All requests are done from your own browser if that's what you mean.

~~~
chasers
No, I mean like render the whole page with JS and all, but from your FAQ it
seems like you're not. Which is why it's fast, ha.

------
rambojazz
Is there a version that works on Firefox?

------
wpasc
Awesome

------
scrollaway
Really cool stuff. Here's some initial feedback:

"Avoid internal link redirects" -> All the errors I'm getting on my site are
due to the login-wall on some of the pages because it's detecting
/account/login/?next=... links as internal redirects.

"Use unique titles" / "Set page descriptions" / "Avoid URL parameters" /
"Avoid thin content pages" / etc -> Same problem as above with login walls
("Sign in to ..."). I get why, but it's adding a ton of noise. I added
/account/login to URLs to ignore but it didn't achieve anything, I'm guessing
I must have misunderstood the syntax or it has to be an exact match or some
such?

"Avoid inline JavaScript" -> I'm making quite a bit of use of the <script
type="application/json" id="foo"/> pattern, which allows me to declare json
objects in the body I can later parse in my scripts. This pattern doesn't have
all the issues carried with inline js. Can you ignore script tags where the
type= is unknown or application/json\ __?

hsts preload: This is picking up individual pages on the checked site as
errors, even though hsts preload really is a domain-wide thing.

"Hide server version data" -> This is picking up "server: cloudflare" as an
error. Means no site behind cloudflare will ever pass this which seems
overkill.

"Use lowercase URLs" -> So, on my site you can access objects with IDs like
youtube's (/id/vZEz7JoNnfgVo...). It's picking up all those as errors. Feels
wrong?

UI: Not a fan of the "x inlinks / y outlinks / headers / recrawl / html /
copy" links below the URLs on the results page. Low contrast, unclear what I'm
clicking and where it's gonna take me. The "copy" button: What am I copying?
(Clearly the URL as I tried it, but that'd be more useful as a clipboard
button next to the URL for example)

Finally, I ran it on my company's blog and it ended up crawling a ton of the
company's various exosites on different domains which wasn't super useful,
especially since none of it showed up in the final results.

Hey, this is a really great tool. Fast, slick UI and very clear what it does.
I'll keep an eye on it and would love to see what else it can do in the
future.

Pricing: It's hard to see myself pay for this; not because it's not worth it
(I think it is easily worth a dozen USD / site checked), but because it's so
easy to look at what it does and think "Yeah, but, I can probably do all that
myself, and if I don't, it's not so important that I need to pay for a tool to
tell me what to fix". I think this is the curse of developing products
targeted at developers: Devs will tend to think "I can do this myself // I
don't need this". In fact, if you hook me up with a free account, I'll use it
a bunch ;)

Shoot me an email (see profile) if you want to talk through some more feedback
(especially UX feedback). You just provided me with a pretty cool service for
free so I feel I have to give back :)

~~~
seanwilson
Awesome, thanks for the detailed feedback! As you can probably imagine,
tweaking the rules to work with every imaginable website configuration is an
ongoing process so this is super helpful.

> All the errors I'm getting on my site are due to the login-wall on some of
> the pages because it's detecting /account/login/?next=... links as internal
> redirects.

> "Use unique titles" / "Set page descriptions" / "Avoid URL parameters" /
> "Avoid thin content pages" / etc

Allowing Checkbot to login could help but I'll look into how to improve this.

> "Avoid inline JavaScript" -> I'm making quite a bit of use of the <script
> type="application/json" id="foo"/>

Ah, thanks, this is an easy fix. I'm planning to add structured data checks in
the future as well because checking you've configured these correctly on all
your pages is cumbersome.

> "Hide server version data" -> This is picking up "server: cloudflare" as an
> error. Means no site behind cloudflare will ever pass this which seems
> overkill.

Yes, for what it's worth this is defined as "low priority" internally and the
rule description is written to emphasise this. I could change it to only fire
when there's version numbers in the headers perhaps. I agree knowing you're
using Cloudflare isn't a big deal but some servers will advertise very
specific OS and PHP versions for example.

> "Use lowercase URLs" -> So, on my site you can access objects with IDs like
> youtube's (/id/vZEz7JoNnfgVo...). It's picking up all those as errors. Feels
> wrong?

Yes, I'll need to think how to avoid that case. It's a good general rule when
you're writing human readable URLs however so I wouldn't want to disable it
completely.

> UI: Not a fan of the "x inlinks / y outlinks / headers / recrawl / html /
> copy" links below the URLs on the results page. Low contrast, unclear what
> I'm clicking and where it's gonna take me. The "copy" button: What am I
> copying? (Clearly the URL as I tried it, but that'd be more useful as a
> clipboard button next to the URL for example)

Hmm, any more suggestions on what to change here? I made these links prominent
because they were common user actions and added tooltips to them to help
describe what they do. I agree it's not completely obvious what they do at
first but there's only so much space available. I experimented with only
showing these when you hover over a table cell. I do want to add more
shortcuts in the future such as a quick way to look up a URL on Google or on
archive.org so I'll likely have a "more" button for extra options later.

> Finally, I ran it on my company's blog and it ended up crawling a ton of the
> company's various exosites on different domains which wasn't super useful,
> especially since none of it showed up in the final results.

Can you give more details here? Checkbot will probe <a href="..."> links to
check they're working for example but shouldn't spider sites that are
considered external. I originally had it crawling subdomains of the start URL
but changed that default because it wasn't what most people wanted.

> Shoot me an email (see profile) if you want to talk through some more
> feedback (especially UX feedback).

Great, let's keep in contact (see my profile as well)! Hopefully it's obvious
UX is important to me too. It's been challenging to find a balance in showing
the right amount of information on the screen while battling with the
horizontal space constraints you get with long URLs. The "Avoid temporary
redirects" report is a good example of this e.g. for each row, you want to
know the redirect status code, the start URL, redirect destination and
redirect path.

~~~
scrollaway
> _Allowing Checkbot to login could help but I 'll look into how to improve
> this._

Wouldn't help in my scenario FWIW, my site is oauth-only login.

> _Hmm, any more suggestions on what to change here?_

I move the copy link into a clipboard button next to the url (like Github's
"clipboard" button next to URLs), and make the remainder of the links
prominent buttons. I would also avoid taking users to a separate page when
clicking any of them; rather, open a "sub view" below the url (eg a nested
list).

If you have less-often-used actions, you could also add a "..." menu on the
right side, or next to the buttons.

> _Can you give more details here_

Try it against articles.hsreplay.net and look at all the URLs it ends up
checking against, you'll see what I mean. It didn't end up spidering all of
hsreplay.net, but it did go through a ton of it.

~~~
seanwilson
> Wouldn't help in my scenario FWIW, my site is oauth-only login.

Would being able to set cookies or sending custom headers help?

> Try it against articles.hsreplay.net and look at all the URLs it ends up
> checking against, you'll see what I mean. It didn't end up spidering all of
> hsreplay.net, but it did go through a ton of it.

Hmm, so if you check "Explore" -> "External URLs" there's a ton of external
links being checked for this URL. I'm not sure what you could do here except
excluding hsreplay.net URLs from being checked.

Thanks for the other tips and examples, I'm actively working on this.

~~~
scrollaway
> _Would being able to set cookies or sending custom headers help?_

I don't think so. And if it's for SEO purposes, I don't care; these pages
won't get crawled by Google anyway. I'm ok ignoring the URLs, but it'd be nice
if maybe for example you detect there's a bunch of redirects to a pattern that
contains "~login~" and ask the user if they want to add the login URL to the
blocklist. I didn't have much success adding it myself.

~~~
seanwilson
> these pages won't get crawled by Google anyway.

Hmm, how do you indicate this to Google? I'm thinking about how you could tell
Checkbot to ignore pages like this.

> I didn't have much success adding it myself.

The "URL patterns to ignore" setting is just a JavaScript regex string if that
helps. It needs some help text at a minimum.

A common scenario I see as well is you start a crawl, see the URLs flying by
and think "oops, don't want to crawl those URLs". A cancel button would help
but hopefully more can be done.

~~~
scrollaway
> _Hmm, how do you indicate this to Google? I 'm thinking about how you could
> tell Checkbot to ignore pages like this._

I don't. I'm not sure how Google picks up on the fact that they're login
walls. Maybe it's heuristics? Someone better at SEO than me could explain.

Re URLs to ignore: Ok, I see, that wasn't clear though. My two-fold suggestion
is 1. Add a way to specify plain loose matching (eg. just /account/login/,
skip the url parameters, etc) and 2. let me add that pattern after the urls
have already been crawled (and make sure I can see which URLs are affected).

Keep in mind that this is pretty raw feedback and you know your product better
than me, but I definitely don't think the URLs to ignore is usable right now.

I'm off to bed, I hope all that helped. :)

~~~
seanwilson
> I don't. I'm not sure how Google picks up on the fact that they're login
> walls. Maybe it's heuristics? Someone better at SEO than me could explain.

Google is likely following the redirect to the login page and seeing that the
login page text isn't relevant to your search results at a guess. It's not a
big deal if Google hides those pages but in Checkbot perhaps those are pages
you want to examine so I'll need to think about what to do here.

> Keep in mind that this is pretty raw feedback and you know your product
> better than me

This kind of feedback is amazing so keep it coming! Knowing the first thing
you thought before taking the time to fully investigate a feature is super
useful because most users would be gone already if they were confused.

> but I definitely don't think the URLs to ignore is usable right now.

Yes, fully agree with that. I think if you're dealing with regexes, you want a
way to test them as it's too easy to make a mistake.

------
ramon
Hi, I have downloaded already but didn't test it yet. I will test as soon as I
can.

