Hacker News new | past | comments | ask | show | jobs | submit login
Katana: A crawling and spidering framework (github.com/projectdiscovery)
99 points by feross on Nov 10, 2022 | hide | past | favorite | 25 comments



What is "next gen" in this implementation? Chrome support?

IMO the hardest things in distributed crawling at scale are a good URL frontier, priorities, rate limiting and things like that, which are quite often overlooked.


Would be nice if HN could remove clickbait terms like "modern", "next-generation", "blazingly fast", etc. These characterisations only look dated if not silly when we look at them years down the road.


In jest: I'd give an allowance to any product that steps right on the boundary of what we currently know as the fundamental limits of physics. Like Shannon entropy for a compression implementation. Or Planck length for processors.


Heheh.

(Planck length is 10e-35. Even the strong nuclear force operates on a scale that's like 20 orders or magnitude larger (10e-15). And a hugantuan electron? Forget about it.)


I wish all technology naming would follow this rule.

Fast Ethernet is my favorite example.


Reminds me of "The New Cook Book"[0] in the kitchen of a family member. That book is older than I am.

[0] (translated title)


Hi

Could contact me? We may have some interests in common. Check my profile.


What are these interests.


"built with Rust" while you're at it :)


I too had the inclination to include that one, but I hesitate because in theory, IMHO, it would be useful to know the language used for each software announcement on HN. In practice, putting the language used into a title is pure hype. Almost invariably, I still have to manually check source listings to confirm what language is being used. Lord only knows how much time I waste checking only to find the project is written in some language I do not use. Today HN titles may be likely to contain "written in Rust" but not long before they frequently contained "written in Go".


We took 'next gen' out of the title since it's borderline clickbaity and tends to be a distraction.


I wrote a crawler a few years ago. I fired it up recently but had little luck in fetching pages. It looked like cloudflare was protecting the site from me.

Are any of you other HNers finding the web increasingly difficult to scrape from?


If you're scraping with Python, try cloudscraper—among other things(!), it supports JS rendering (basically the bare-minimum check cloudflare does), without needing to run a full browser in the background. It's built on requests, so integration (for me, anyway) was pretty easy.

https://github.com/venomous/cloudscraper


Is it a 301 that keeps looping? I noticed my company's website does that when I try to cURL, and I wondered if it was cookie based, or how to get around it.

(EDIT: Yes, I figured it out. To get around the 301 loop. cURL needs to save cookies)


For "a next-generation crawling and spidering framework", it's a little surprising to see no support for the WARC[1] format.

[1]: https://en.wikipedia.org/wiki/Web_ARChive


Wonder why the Internet Archive never tried to build a web search engine - their crawls of the entire web could be more comprehensive than Google (assuming Google doesn't archive old copies of websites)


Brewster Kahle, the IA's founder, did. It is called Alexa Internet, and was sold to Amazon:

<https://en.wikipedia.org/wiki/Alexa_Internet>

<https://help.archive.org/help/wayback-machine-general-inform...>

A condition of that sale was that Alexa would continue to provide the results of its crawls, after a delay, to the Internet Archive. Those crawls form a substantial portion of IA's Wayback Machine archive.

I'm not certain that those archive are ongoing, as Alexa seems to have been largely shut down.

IA are a bit cagey on details, but I believe that there is a general IA-based archival service. There's certainly the "Save Page Now" feature:

  https://web.archive.org/save/<URL>
And the independent but closely-cooperating ArchiveTeam (lead by Jason Scott) tailors crawlers specific to endangered / vulnerable online websites, its Warrior software:

<https://wiki.archiveteam.org/>


Interesting, from a consumer's perspective I never liked Alexa. But from a hoster's perspective it was awesome. Especially when you're in the top 1000. It helped my site get more popular.


That’s both really intriguing, and horrifying!

It’s already _technically_ impossible to erase something from the internet, but if they removed the barrier to knowing where something was before in order to find it in the archive, it would be truly impossible in every sense of the word.


Crawling should be the easiest part.


I don't know if there is an easy part in search. Almost every aspect of it has unique challenges.

Large scale crawling is primarily a challenge in balancing the logistics in a way that is kind to both the crawler and the data consumers.

Distributed crawling, if you go that way, is also non-trivial as you're effectively juggling a shared rapidly mutating state in the dozens gigabytes.


At a guess, WARC wants headers and stuff that are at the very least inconvenient to get at with your usual headless browser drivers. I also have a hunch WARC may also not be entirely well defined when archiving js-rendered websites.


Was there any specific library this was inspired by, or a specific use case it was built for besides the obvious generic case?


How does this handle headless? Does it just come with a baked chrome binary?





Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: