
Show HN: robots.txt as a service, check web crawl rules through an API - fooock
https://robotstxt.io
======
dscpls
Why a service and not a library?

It looks like a great way for you to discover URLs but like a terribly slow
way for people to avoid implementing robots.txt rules.

~~~
fooock
The project have multiple subprojects, and one of these is the parser. Any
developer can compile or extend it without more effort and create a library.
Just know to code in Kotlin / Java.

The aim of this project is only check if a given web resource can be crawled
by a user-agent, but using a API

------
tehwhale
While this looks good, I don't think it's feasible for a web crawler in most
cases. Crawlers want to crawl a ton of URLs and it would have to make a
request to your service for each and every URL.

What's the plan here? Check for a sitemap.xml (which generally only contains
crawlable URLs anyway) or crawl the index and look for all links and send a
request to your service for every URL before crawling it?

I personally think it would be better suited as a library where you can pass
it a robots.txt and it'll let you know if you can crawl a URL based on that.

~~~
fooock
I implemented this service thinking in make a network request for each new URL
that needs to be crawled. Internally the service caches all requests by the
base domain and user agent. The responses are very fast if these domain was
previously checked.

For example if you want to check the url
[https://example.com/test/user/1](https://example.com/test/user/1) with a user
agent MyUserAgentBot, the first request can be slow (~730ms) but subsequent
requests with different paths but same base url, port and protocol, will use
the cached version (just ~190ms). Note that this version is in alpha and many
things can be optimized. The balance between managing these files in different
projects or the time between network requests must be sought.

Anyway, any person can compile the parser module and create a library to check
robots.txt rules by itself ;-)

PS: thanks for the feedback

------
itsmefaz
The service is very nice and I understand your reason for developing it. I see
this service to be having more value in helping companies find all the web
pages, rather than just the allowed ones.

I understand the unethical nature of the above method, however, I see it
happening quite a lot in practice.

~~~
fooock
Yes, in the practice people sometimes don't want to be polite with webmasters,
and choose not obey robots.txt rules. Thanks for the suggestion!

~~~
itsmefaz
Exactly, your service could definitely be used as an alternate to parsing
robots.txt (which traditionally is in xml) to a more standard json parsing.
Along with the advantages that comes with making it REST.

------
fooock
I created this project to use in my projects. It is open source. You can use
it if you are implementing SEO tools or a web crawler. Note that this is a
first alpha release.

Give me some feedback!

