Yeah, I'm sort of surprised that there isn't a semi-popular "web grep" tool for people who would rather use regex, some understandable ranking algorithm with knobs to tweak, etc.
Of course, you'd have to read a manual to use it and it would have a ton of spam, but some people just want lower-level control - they still sell stick-shift cars.
Not just that, but the sheer scale of such an index. The size of the web now just makes anything small next to impossible without a lot of funding. And none of the existing search engines will probably allow you programmatic/data access to their index without a metric ton of cash
How is it that spiders/bots are able to "index" copyrighted content? Is it just one of those things where the ends justify the means or a holdover/tradition or some such?
It’s some combination of fair use and raw data not being copyrightable. My understanding is that only the creative expression that’s copyrighted, and not the actual words. So, if you distill out all of the creativity into something that’s purely information about the work, you’re probably fine copyright-wise.
There’s a long tradition of compiling and publishing concordances, which are just indices of every place each word appears in the original text. They’re generally not useful without access to the original, so noboy seems to mind them very much. Google’s index is just a modern form of the same thing.
Of course, you'd have to read a manual to use it and it would have a ton of spam, but some people just want lower-level control - they still sell stick-shift cars.