I created a metasearch for myself based on the idea of "continuation searches". One obvious point of collaboration bwtween search engines could be a uniform API and SERP format. Currently, there are slight variations between search engines in terms of the submission URL syntax/paramaters and the HTML used to display results, not to mention HTTP method, limits on number of results and sometimes additional, optional URL parameters. The differences are generally small^1 and this makes it relatively easy to create a personal metasearch. However it could be much less cumbersome if these differences were eliminated.
1. Exceptions are, e.g., ones that require two HTTP requests per query, such as Gigablast or ones that have strange limitations, e.g., Startpage, which has become unusable for me without Javascript. Contacting their "customer support" yielded no response.
Even better would be if search engines all shared their indexes and made these available for download. This would faciltate people building new search engines without needing to have their own index. In theory it would also bring a stop to the problem of people who submit large numbers of queries since all the bulk data they need would be available for download. www indexes that comprise public information could be freely shared as public data.
An index is, lowballing it, hundreds of gigabytes of dense binary soup; probably in some custom format specific to that search engine (sometimes there's some form of hash table going on, sometimes a B-tree), almost certainly with its own idiosyncrasies concerning keyword extraction. I think reconciling API differences is probably a lot easier than making use of index data.
I still quite like the idea of having a number of independent search engines each indexing their own specialist subjects, and one or more federated search front-ends which can pull these together.
Doing it with APIs is a little tricky to make work in a usable way though. There have been various attempts at standardised APIs, e.g. OpenSearch[0], and metasearch engines like searX[1] have what are essentially pluggable scrapers, but there are still fundamental issues like getting different results at different times and having different ranking mechanisms.
Integrating at the index level could make a more usable search, but there are lots of other issues with this approach, e.g. those experienced with Apache Solr's Cross Data Centre Replication[2]. And yes, the volumes of data may also be an issue, given a search index will typically be slightly larger than the compressed data size, e.g. the 16M wikipedia docs are approx 32Gb compressed and approx 40.75Gb in a search index.
1. Exceptions are, e.g., ones that require two HTTP requests per query, such as Gigablast or ones that have strange limitations, e.g., Startpage, which has become unusable for me without Javascript. Contacting their "customer support" yielded no response.
Even better would be if search engines all shared their indexes and made these available for download. This would faciltate people building new search engines without needing to have their own index. In theory it would also bring a stop to the problem of people who submit large numbers of queries since all the bulk data they need would be available for download. www indexes that comprise public information could be freely shared as public data.