
Fast website link checker in Go - raviqqe42
https://github.com/raviqqe/muffet
======
jchw
I find the use of rake to be kind of unorthodox, and yet I don't know what
else you'd use in the Go world, other than maybe just Makefiles. Any
particular reason to choose Rake? It's probably not easy to get it running on
Windows based on my experience playing with Rails on Windows.

Other than that it looks quite useful, and it's definitely something to keep
in the tool belt. Bonus points for the subtle Undertale references too :)

~~~
mjk7841
I'm also curious about this. It seems that many go developers out there are
using Makefiles. Makefiles are a good solution for golang projects in some
cases, but I've seen a lot of people really abusing Makefiles and trying to
use them for more generic task running.

In a past life, we used invoke [1] for task running. It was incredible but has
the same problem as rake: it introduces another language (Python) and more
dependencies.

There's a fairly new task runner being developed in go called mage [2], but it
didn't seem worth the jump yet to me as it's still pretty immature (I haven't
played with it in a few months, though). Did you consider trying that out?

[1] [https://github.com/pyinvoke/invoke](https://github.com/pyinvoke/invoke)
[2] [https://github.com/magefile/mage](https://github.com/magefile/mage)

~~~
anonfunction
I like using make even for basic task running. It’s usually installed where
I’m working and is familiar. I don’t want to install rake or mage.

~~~
arbie
Make gets a bad rap but I've seen it being used for susbtantially complex
workflows. If you wrap your commands in something that can reattach to running
processes and use dependencies correctly, it is hard to displace.

------
omeid2
A comprehensive prompt is an extremely useful and an underappreciated
productivity booster.

But please, when making screencast or screenshots, use a simple prompt.

Because the information included in the prompt is often not only irrelevant,
it can be actively distracting and makes reading through less pleasant by
increasing the required eye-movement and cognitive effort to filter out the
useless content.

Small little things makes a difference.

~~~
chrismorgan
I wrote a script bash-for-recording for myself a few years back, for
invocation by termrec, which set the window size to 80×24, set HOME to a new
empty directory, set USER and NAME to dummy values (creating an actual new
throwaway user account would be better here, but I was lazy), set TERM to
xterm-256color (I think it was), cleared the environment (env -i) and possibly
one or two other things, set a deliberately very simple and obvious prompt
(which sets the title as well), cleared the screen, then finally started a
nice clean bash profile.

I should pull out my old laptop or my backups and resuscitate the script.

The example at [http://tty-player.chrismorgan.info/](http://tty-
player.chrismorgan.info/) was generated using that script.

~~~
chrismorgan
I remembered a couple of details incorrectly, but this was my bash-for-
recording script:

    
    
      #!/bin/bash
    
      # Start a new bash shell with a severely filtered environment and no initfile.
      if [ -z "$_BFR_RUNNING" ]; then
          env -i \
              _BFR_RUNNING=1 \
              PATH="$PATH" \
              LD_LIBRARY_PATH="$LD_LIBRARY_PATH" \
              TERM="$TERM" \
              SHELL="$SHELL" \
              USER="$USER" \
              HOME="$HOME/bfr-home" \
              LANG="$LANG" \
              bash --init-file "$0" "$@" <&0
          exit $?
      else
          unset _BFR_RUNNING
      fi
    
      # What remains of this file is the initfile.
    
      USER=user
      HOSTNAME=hostname
      PS1='\n\[\033[32;45;1m\]\w\[\033[m\]\$ '
      eval "$(dircolors -b)"
      alias ls='ls --color=auto'

------
gtirloni
Pretty nice tool that Just Works (tm).

Although the concurrency level (512 connections) is a bit too aggressive for
most servers. You'll get throttled, blocked or your backend will crash (which
isn't too bad in any case, except that might not be what you were after with a
link checker).

~~~
raviqqe42
Actually, I totally agree with you. I decided the number based on the default
maximum number of open files on Linux because I was not sure about common
limits of concurrent connections between clients and HTTP servers. Or, it
should probably regulate numbers of requests per second sent to the same
hosts. If someone suggests other options, I would adopt it.

~~~
jlgaddis
Both Chrome and Firefox limited the number of connections to a server to six
(6), if memory serves. I'm not certain if those limits have changed or if the
number is different based on HTTP 1.1 versus HTTP 2.

~~~
brobinson
The limit is per _hostname_ not per _server_ (unless things have changed in
the last 10 years).

This is why you'll see assets1.foo.com, assets2.foo.com, etc. all pointing to
the same IP address(es). Server-side code picks one based on a modulus of a
hash of the filename or something similar when rendering the HTML to get
additional pipelines in the user's browser. Not sure how or if this is done in
SPA.

------
zdw
How fast is this? Is there even a common benchmark for this sort of thing?

How does it compare to the older python "linkchecker", which was resurrected
here (and is quite fast):
[https://github.com/linkchecker/linkchecker](https://github.com/linkchecker/linkchecker)

~~~
chrismorgan
I have found linkchecker to be unreasonably slow by default, very fragile if
you try to make it any faster (e.g. occasional socket errors with almost _any_
concurrency, regardless of the purported ulimits, in a way that suggested some
kind of socket leaking to me at the time), and possessing fairly bad
reporting. Also, on Windows, being built with ancient Python meant it didn’t
support SNI, so I had to delve into it to figure out a way of turning TLS
verification off to make it work pretty much _at all_ , which also hints at
its generally poor configurability and surprisingly poor documentation (given
the fact that there _is_ actually a moderate amount of it).

I still _use_ linkchecker (because I’ve never completed my Rust-based link
checker that I started several years ago), and have extended it at work to
support client-side certificates which we use on CI, but I’m generally fairly
unimpressed with linkchecker.

------
fwip
Looks rad. Is there any plan to add authorization headers, so that I can test
a site as a particular user?

~~~
raviqqe42
Thank you for your feedback! Can you open an issue on the GitHub repository? I
would appreciate it if you add some concrete use cases as then it'll be clear
what kinds of options need to be implemented.

------
bryanrasmussen
I don't know Go, but looking at the code it doesn't look like it handles sites
that are rendered on the client, if so it has limited utility in today's web.

------
jpsim
Anyone know of an equivalent for Ruby? I’d love to add this as an integration
test for jazzy[0].

Separately, it’d be awesome if this also checked that anchor links resolve to
id values to validate linking to specific elements on a page.

[0]: [https://github.com/realm/jazzy](https://github.com/realm/jazzy)

~~~
jamietanna
[https://github.com/gjtorikian/html-
proofer](https://github.com/gjtorikian/html-proofer) is what I use

------
jwilk
Are there any good checkers for URLs in text files?

I wrote [https://github.com/jwilk/urlycue](https://github.com/jwilk/urlycue) ,
but I'm not quite happy about it, I don't have energy to improve it either; so
I'm looking for alternatives.

~~~
oneeyedpigeon
I guess one of the bigger challenges when it comes to unstructured data is
identifying URLs. Is there a canonical way of identifying a URL embedded in
text? Is it an impossible problem?

~~~
ethhics
Perhaps I’m missing the question, but there _is_ a regular expression for
matching a URI. Remove the leading carat and it can match anywhere in a text.

[https://tools.ietf.org/html/rfc3986#appendix-B](https://tools.ietf.org/html/rfc3986#appendix-B)

Edit: I see it's not quite that simple. However, I still think that with some
stricter matching requirements this could work.

~~~
jwilk
This regexp lets you parse a valid URI, but it matches also a lot things at
are not URIs at all.

The URI language is of course regular, so it would be possible to construct a
regexp that matches only URIs. But naively applying such regexp wouldn't work
in practice, because many punctuation characters are allowed in URIs. For
example, single quotes are allowed, so in this Python code the regexp would
match too much:

    
    
       homepage = 'http://example.com/'

~~~
ethhics
I see—The standard allows most characters we normally use to surround URIs. It
sure does look like a difficult problem then, and one that a regexp can't
solve.

------
andyonthewings
I have been using a nodejs equivalent, named broken-link-checker
[https://github.com/stevenvachon/broken-link-
checker](https://github.com/stevenvachon/broken-link-checker)

I run it in a CI job which builds a static site. It works pretty well for me.

------
gruzh
pylinkvalidator seems to be a pretty good python alternative for this project.
It's also very fast, scans 700+ URLs in under a minute
[https://github.com/bartdag/pylinkvalidator](https://github.com/bartdag/pylinkvalidator)

------
thamizhan2611
Awesome tool. I have one small request, is it possible to put an architecture
diagram connecting different moving parts like parsing, fetcher, daemon etc.
IMHO this might be useful for people who are trying to go through the source
code to understand how the tool functions. Thanks anyways :)

------
peterwwillis
So, I'll bite. Why user this and not wget, curl, or any other http spider?

~~~
pknopf
Is there a good html parser that you can pipe into/out of? Using Go, it would
be easier to parse the html, and tell the difference between
<p>[http://somewhere.com/</p>](http://somewhere.com/</p>) and <a
href="[http://somewhere.com/">test</p>](http://somewhere.com/">test</p>)

~~~
laumars
Personally I don't care about the difference between your two examples because
a published URL is still a published URL regardless of whether it is a
hyperlink or not.

However where I do care about the difference between an anchor and a paragraph
block is with relative links or other URLs without the protocol prefix as
those are harder to programmatically guess what is a web URL and what is an
example file system path in a technical document (for example). In an ideal
world the JavaScript, CSS and any web-APIs (eg JSON returns) would be executed
locally to check any modern way of abstracting away URLs (page redirection et
al). But that's not to say there isn't a place for a less sophisticated parser
(though I would say that as I've also written a link checker similar to the
one posted hehehe).

