Hacker Newsnew | past | comments | ask | show | jobs | submit | fooock's commentslogin

seems the site is broken


I believe it's restricting based on http referrer. if i click the link, i get a 429. if i refresh, i get a 429. if i paste the url in the address bar and hit enter, it works fine.



I see I cannot rely on Streamlit cloud :( After refreshing twice, it works for me now.


how is this different from using Earthly and a private Nexus instance?


Not sure about Earthly as I’ve never used it, but for Nexus and other artifactory solutions around package repos:

1. Don’t need to set up anything. No need to set up an apt repo and push packages or configure a mirror, because StableBuild already caches the complete registry and thus has everything. 2. Don’t need to think about the complete package list when pushing files to artifactory. Have packages cached from 3 months ago, now want to add another one? Oops, not in Nexus, and the current versions in the Ubuntu package registry are not compatible with your cached versions -> need to update the full dependency tree. 3. Integration is trivial. Three lines to your Dockerfile and done. 4. Can retroactively fix things. Knew that this container built 4 weeks ago? Ok use that as a pin date -> fixed.

Then there’s some stuff, like the immutable Docker pullthrough cache and history pypi mirror that I haven’t seen before (but I’d like to learn if others are doing this :-)).


Yes, in the practice people sometimes don't want to be polite with webmasters, and choose not obey robots.txt rules. Thanks for the suggestion!


Exactly, your service could definitely be used as an alternate to parsing robots.txt (which traditionally is in xml) to a more standard json parsing. Along with the advantages that comes with making it REST.


I implemented this service thinking in make a network request for each new URL that needs to be crawled. Internally the service caches all requests by the base domain and user agent. The responses are very fast if these domain was previously checked.

For example if you want to check the url https://example.com/test/user/1 with a user agent MyUserAgentBot, the first request can be slow (~730ms) but subsequent requests with different paths but same base url, port and protocol, will use the cached version (just ~190ms). Note that this version is in alpha and many things can be optimized. The balance between managing these files in different projects or the time between network requests must be sought.

Anyway, any person can compile the parser module and create a library to check robots.txt rules by itself ;-)

PS: thanks for the feedback


The project have multiple subprojects, and one of these is the parser. Any developer can compile or extend it without more effort and create a library. Just know to code in Kotlin / Java.

The aim of this project is only check if a given web resource can be crawled by a user-agent, but using a API


I created this project to use in my projects. It is open source. You can use it if you are implementing SEO tools or a web crawler. Note that this is a first alpha release.

Give me some feedback!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: