
Common Search – nonprofit search engine for the Web - hjacobs
https://about.commonsearch.org/
======
onion2k
One thing I hope this project does that Google fails to do is give developers
a good API to access search. Google closed down their first web search API and
now only give developers access to a limited Custom Search API that's rate
limited to 100 queries a day for free with a hard limit of 10k searches - that
makes it either very hard to develop anything against or relatively expensive.
There are other options (Bing, Faroo, raw access to CommonCrawl) but they're
either low quality or hard to work with. A good quality, straightforward, open
web search API would be awesome.

~~~
mtrn
I would pay for an API, that gives me access to partial quality crawl content.

It would be even better, if the web would be treated, at least for parts, as a
digital library and non-profit organizations would recognize the value of
access to such a resource and provide it (just as maintenance for roads or
public schools).

~~~
sylvinus
Thanks for the feedback! What would be your usecase for that API?

~~~
mtrn
I would love to have a list of URL of all .edu domains, which contain the word
"publications". It's a bit silly, but I'd love to build a open version
something like Google Scholar.

I can think of many other use cases as well, where a product needs to built
from larger, but very selected set of raw input or pages.

BTW: Thanks for Hackday Paris 2011! Loved that event and venue :)

~~~
struct
This is basically my use-case too, in that it's more important to get access
to _every_ document which contains that keyword and less important to rank
them in a search-engine order. I think that it may be possible to do fairly
inexpensively[1] but I'm still benchmarking to pick the right mix of
technologies and data structures.

[1]
[https://www.getguesstimate.com/models/4225](https://www.getguesstimate.com/models/4225)

~~~
deusu
(I'm not with CommonSearch. I have my own project that crawls extensively
though.)

You do realize that you are talking about potentially a _LOT_ of data?

To give you an example: The word "work" occurs on about 4% of all web-pages.
So even if there were _only_ about 2bn pages in an index, that would mean 80
million matching pages. Even if you only need their URLs that would be about
2.4gb of data assuming an average URL length of 30 bytes. Ok, compression can
make that smaller, but still...

It would also mean that the server would need to make 80 million random reads
to get the URLs. Even with SSDs that would take some time. Hmm, actually in
this case it may be faster to just read all URL-data sequentially, than doing
random reads. But in both cases we would be talking about _minutes_ needed to
get all that data from disk.

I currently have a search-index with about 1.2bn pages - I expect to reach 2bn
pages by mid-May - that could be used to get the kind of data you need. But
not in a realtime API. Not _that_ amount of result-data.

~~~
mtrn
Interesting. To be honest, a static data set would be perfectly fine for a
first batch processing attempt.

> that could be used to get the kind of data you need.

Cool. Would you be interested in sharing or exchanging data?

~~~
deusu
I'm always open to new business opportunities. :)

What would be more useful to you, the raw data - meaning for each page a list
of the keywords on it - or the reverse-word-index?

Raw-data may be better for batch-processing or running multiple queries at the
same time.

My crawler currently outputs about 40-45gb of raw-data per day (about 30
million pages). Full crawl will be 2bn pages, updated every 2-3 months.

The reverse-word-index would be about 18gb per day for the same number of
pages.

Reverse-word-index is already compressed, raw-data isn't.

There is a small problem with the crawl though, as it does not always handle
non-ascii characters on pages correctly. I'm working on that.

BTW: I also currently have a list of about 8.5bn URLs from the crawl. About
600gb uncompressed. These are the links on the crawled pages. Obviously not
all of those will end up being crawled.

------
libeclipse
I've tried using different search engines to Google numerous times, but each
time I've returned to Google simply because the searches are better. They're
more accurate, more relevant, and I very rarely find myself searching more
than once to find something.

If commonsearch can beat Google in that regard, then count me in. But I doubt
it will.

~~~
sylvinus
Hi! I'm the founder of Common Search.

I don't think search result quality is on a linear scale so it's hard to
define "better".

The results will definitely be less personalized, which will be a big plus for
some people, and a blocker for others. There will be a few other dimensions
where we can stand out, and some where we will have a hard time catching up
(index size for instance).

In the end, given enough contributors, I'm pretty sure the results can get
"good enough" for most people, and hopefully "better" for some ;)

~~~
libeclipse
I think it's to do with the search engine's actual algorithm more than
personalisation. Even on a completely new computer or while using tor,
Google's results are pretty much always spot on.

Regardless, I'll be keeping an eye on CommonSearch.

~~~
mobiuscog
Confirmation bias, perhaps ?

------
whazor
I think people might underestimate the power of an open source search engine.
In my eyes it is like wikipedia versus the old paper encyclopedia books.
Improvements to search results in Google are done by a relatively small amount
of people from Google. Google decides where you buy, what you think and how
you live. Behind their algorithms they probably have made dozens of subjective
choices. Public debate, more attention to details, and open politics are as I
see it, great tools to improve search engine quality.

~~~
praxulus
>Improvements to search results in Google are done by a relatively small
amount of people from Google

How many open source projects log more engineering hours than Google's search
team? It's the flagship product of one of the largest corporations in the
world.

~~~
whazor
Wikipedia has 133,621 active registered users, and 27,755,916 users.
Furthermore, Wikipedia has 819,043,068 page edits. While Google will probably
have better and more engineers, but they rely on usage statistics, and not
experts in the specific search domains.

------
jasode
I like the project's goal but as techies, we inevitably want to understand the
technical details and how it helps (or handicaps) the search results in
comparison with Google.

For example, the project's data sources[1] says that the bulk of data comes
from The Common Crawl. It looks like the CC is ~150 TB of data[2]. I'm not
familiar with google.com internals but various sources estimate that their
proprietary crawl dataset is more than a petabyte. (A googler could chime in
here with more accurate data.)

So it's not as simple as the _algorithm_ for Common Search being "more fair"
than the algorithm for Google Inc. The underlying dataset in terms of
quantity, recency, rules for the robot, etc all affect the algorithm.

This is not a criticism of the project. It is my attempt to understand what is
not obvious on the surface level.

[1][https://about.commonsearch.org/data-
sources](https://about.commonsearch.org/data-sources)

[2][http://commoncrawl.org/2015/12/november-2015-crawl-
archive-n...](http://commoncrawl.org/2015/12/november-2015-crawl-archive-now-
available/)

(I'm can't tell if each archive of MM/YYYY is cumulative or an addendum.)

~~~
sylvinus
Hi! Data is indeed as important as the algorithm. Common Crawl is a very good
bootstrap but we will certainly need to go beyond once it proves to be the
limiting factor. We also hope we can help them improve their data set in the
short term by giving them a larger URL seed list.

------
mynewtb
Seeing how the founder is the same who founded Jamendo which later was turned
into a sad, user-unfriendly attempt to make money with freely licenses music
(destroying its community in the process), how can I trust commonsearch not to
be a waste of time and attention?

~~~
sylvinus
So much anger!

I left Jamendo 6 years ago so I have limited influence on what they do now,
unfortunately.

Common Search is a nonprofit and 100% open source so it is fundamentally
different.

~~~
Fastidious
I do not sense any anger on the OP comment. It sounds like a legit, real
concern.

~~~
sylvinus
Well I actually share some of that anger so I'm sorry if I read too much into
"destroy" and "sad". Common Search is forkable by design so it should
hopefully stay on course one way or the other!

------
jdimov10
If it keeps being THAT fast after they've indexed the whole web, I'm switching
search providers! :)

------
rmc
I'm trying to find out from their website, but it's unclear. Are the servers
hosted in the USA? And will the organisation be incorporated in the USA?

If you're talking about privacy and transparency, it's better to operate in a
place bound the European Charter of Fundamental Rights, rather than the US
Constitution, because the former gives people _much_ more rights with their
data, how it's used, etc.

~~~
sylvinus
It is very early so we are not yet incorporated. The issues you mention will
definitely be taken into account!

At scale, we'd probably have multiple legal entities in different countries
anyway, like Wikimedia.

~~~
rmc
Thanks for your concern. However I was under the impression that Wikimedia was
a US organisation, with some local chapters. Perhaps incorporate in a EU
country from the start?

------
faizshah
I like it!

The explainer tool gives a really cool insight into the results:
[https://explain.commonsearch.org/](https://explain.commonsearch.org/)

------
struct
Neat, I was working on a project to give a full programmatic keyword index to
the contents of the common crawl, but I guess there's no need! It's very
exciting to consider what kind of applications you can build with this.

------
mrfusion
I'd love to see a Wikipedia styled search where people can improve or flag
results as they see fit. I wonder if that has been tried.

Sure it might not handle the long long tail but the top ten million searches
would still be pretty useful.

~~~
deusu
I think Google once mentioned that each day a surprising amount of searches
are unique. They were never done before. If memory serves me correct that
number was 30-40%.

~~~
mrfusion
Even if that's true, 60% is still pretty useful.

------
ocdtrekkie
This sounds awesome. Speaking of building AIs/bots and such in your FAQ, the
lack of a good open API for search is probably what gates that market to
Google and Microsoft and such... That nobody else can just tap a search
engine. I'd love to be able to connect to this for queries at some point.

------
PaulHoule
"nonprofit" for me is a bad smell. I.e. the problem of sustainability, which
for nonprofits is all about the money and not about carbon or solar energy,
rainbows, plutonium or any of that.

------
tonylxc
I'm particularly interested in the discuss forum. Is it an open source one or
built yourselves? Thanks!

~~~
sylvinus
Oh no, one project at a time :)

We used [http://www.discourse.org/](http://www.discourse.org/), which I
recommend.

