I think it has to do with shifting expectations. All of us who use the Web serio...

not2b · on April 1, 2020

No, you don't want a full text search engine. If you think you do, you don't remember the pre-Google world. It was impossible to use the older search engines to find a reasonable explanation of a common topic, because to Alta Vista and other search engines of that era, every page that contained a given term was considered equal to every other page, and it would give you all of them in a random order. You could add lots of AND and OR to try to exclude what you didn't want, and this might cut you down to 40 or 50 pages to go through to maybe find what you want.

But when Google first came out, it was a shock. You could just search for something like "Linux", and the most authoritative sites all showed up on the first page.

userbinator · on April 1, 2020

and this might cut you down to 40 or 50 pages to go through to maybe find what you want.

At least those search engines gave you that many results to go through... now Google gives you less than that, full of spam (despite the index probably containing far more), and you'll be in CAPTCHA hellban if you try harder to get to the rest.

miracle2k · on April 1, 2020

A full text search with a good ranking is still a full text search. The point here is that Google used to do the job just fine, but no longer is.

kevin_thibedeau · on April 1, 2020

AltaVista's killer feature was the NEAR operator.

jkaptur · on April 1, 2020

Yeah, I'm sort of surprised that there isn't a semi-popular "web grep" tool for people who would rather use regex, some understandable ranking algorithm with knobs to tweak, etc.

Of course, you'd have to read a manual to use it and it would have a ton of spam, but some people just want lower-level control - they still sell stick-shift cars.

zo1 · on April 1, 2020

Not just that, but the sheer scale of such an index. The size of the web now just makes anything small next to impossible without a lot of funding. And none of the existing search engines will probably allow you programmatic/data access to their index without a metric ton of cash

artificial · on April 1, 2020

How is it that spiders/bots are able to "index" copyrighted content? Is it just one of those things where the ends justify the means or a holdover/tradition or some such?

kd5bjo · on April 1, 2020

It’s some combination of fair use and raw data not being copyrightable. My understanding is that only the creative expression that’s copyrighted, and not the actual words. So, if you distill out all of the creativity into something that’s purely information about the work, you’re probably fine copyright-wise.

There’s a long tradition of compiling and publishing concordances, which are just indices of every place each word appears in the original text. They’re generally not useful without access to the original, so noboy seems to mind them very much. Google’s index is just a modern form of the same thing.

sefrost · on April 1, 2020

I wonder how many pages would be worth indexing for such a tool though.

zo1 · on April 1, 2020

Probably a tiny subset. But the problem is finding that small subset!