
Wouldn't it be fun to build your own Google? - martinkl
http://radar.oreilly.com/2014/12/wouldnt-it-be-fun-to-build-your-own-google.html
======
Smerity
[lightly modified version of a comment I put on the article as I love HN for
discussion!]

Great article -- we're excited there's so much interest in the web as a
dataset! I'm part of the team at Common Crawl and thought I'd clarify some
points in the article.

The most important is that you can download all the data that Common Crawl
provides completely for free, without the need to pay S3 transfer fees or
process it only in an EC2 cluster. You don't even need to have an Amazon
account! Our crawl archive blog posts give full details for downloading[1].
The main challenge then is storing it, as the full dataset is really quite
large, but a number of universities have pulled down a significant portion
onto their local clusters.

Also, we're performing the crawl once a month now. The monthly crawl archives
are between 35-70 terabytes compressed. As such, we've actually crawled and
stored over a quarter petabyte compressed, or 1.3 petabytes uncompressed, so
far in 2014. (The archives go back to 2008.)

Comparing directly against the Internet Archive datasets is a bit like
comparing apples to oranges. They store images and other types of binary
content as well, whilst Common Crawl aims primarily for HTML, which compresses
better. Also, the numbers used for Internet Archive were for all of the crawls
they've done, and in our case the numbers were for a single month's crawl.

We're excited to see Martin use one of our crawl archives in his work --
seeing these experiments come to life the best part of working at Common
Crawl! I can confirm that optimizations will help you lower that EC2 figure.
We can process a fairly intensive MR job over a standard crawl archive in
afternoon for about $30. Big data on a small budget is a top priority for us!

[1]: [http://blog.commoncrawl.org/2014/11/october-2014-crawl-
archi...](http://blog.commoncrawl.org/2014/11/october-2014-crawl-archive-
available/)

~~~
curiously
what would be the benefit of having the web as a dataset when it is rife with
copyright laws (see craigslist), monopoly businesses who viciously protect
their human uploaded content and profiles, and majority of the data from the
web being useless without a context and purpose of the searcher?

I'm just curious as to how commoncrawl compares with kimonolabs and import.io
as they seem to have the same goal of creating an internet as a dataset, or an
API. I can't help but feel like it's just solving another 'semantic web'
problem that nobody asked for.

It is funny that the most demanding customers of semantic web are also the
ones who are willing to pay the least amount of time and money.

~~~
jobposter1234
Regarding copyright issues, I believe you can still use copyrighted data as
long as it's transformed. E.g., building language models, or doing a search
engine like Google. In fact, I can think of more computational uses for
copyrighted data, while on the "banned" side, I can only think of... SEO.

Regarding point two: "monopoly businesses who viciously protect their human
uploaded content". I spend a lot of time scraping these monopoly businesses,
and it seems to me they do a decent job of letting their users decide what
data is exposed. Facebook, linkedin, and google all are decent about letting
me scrape their public info. That's all I have a right to -- private info
should stay private, at the behest of the owner (the User in UCG).

You are correct regarding the third point, but I don't see that as a problem.
This isn't a solution in search of a problem -- it's a problem without a
solution at the moment.

Here's a toy example of something I'd like to do: calculate the positive /
negative sentiment of commenters at particular baseball fan sites, so I can
hide the content I don't like, and show that which I do. Having a common crawl
of the site would be immensely useful (and is indeed a prereq) for this. I
wouldn't need to republish it, just compute on it.

------
pjbrunet
At first Google was a search algorithm, but at some point they decided to have
humans review and rank the important queries. Important as in query volume.

Why use humans? People can decide if your navigation is intuitive. They can
decide if your page looks like crap. If 230,000 people are searching for
"coconut oil" per month (actual numbers) then it's worth having an intern
spend 15 minutes to make sure page 1 of "coconut oil" looks right.

Google can afford that. They need a human to decide if the "user experience"
is actually good vs. disallowing the back button and forcing the browser to
crash, which is how I suppose you could fake a "time on site" metric if this
was just an algorithmic problem.

Google is now more like playing Zork. You type "Go North" like 10 million
other people before you typed "Go North" and Google has already crafted that
experience you'll find in next room. (Which makes me wonder, do they score how
boring you are based on predictability?) This is becoming more and more
obvious over time as a search for "calculator" shows you an actual calculator
that a human at Google created. That's not an algorithmic response.

Similarly, I see that human touch coming more into play with voice
recognition, Google Glass, Siri, etc. Call that "AI" or whatever. You ask
Google a question and Google has already sculpted a slick answer based on tons
of testing. That's how I see Google as a search engine now. Part of the
crawling is interesting (recognizing objects in photos?) but I think human
reviews of all the important websites and SERPs, that's harder for a
competitor to reproduce.

~~~
Animats
_I think human reviews of all the important websites and SERPs, that 's harder
for a competitor to reproduce._

Google was forced into that by improved "search engine optimization". SEO used
to be about things like keyword stuffing, but as Google made their search
engine smarter, SEO companies made their search spamming smarter. There are
now SEO operations using machine learning to reverse engineer Google's
algorithms and then automatically spam just enough to stay under the
threshold.

In 2010, Google tried using "local" data to improve search. That turned out to
be extremely easy to spam. A classic example of this can be found by searching
for "laptop repair bradford pa". This brings up "Illusory Laptop Repair",
located in the middle of a railroad crossing. A SEO expert created that phony
business listing to demonstrate how bad Google was at detecting such spam.
Google still thinks it's real.

In 2012, Google tried using "social" data to improve search. That worked even
worse. Fake Google accounts created to create fake "+1"s may have exceeded the
number of real ones. Google "+1" are still for sale; the going rate is about
$0.10 each.

Meanwhile, links aren't as useful as they used to be. Who creates a link to a
retail outlet other than on social media any more? Google is trying all sorts
of "signals", but in heavily spammed areas, they're not doing all that well.

Yandex has been trying search that doesn't weight links at all for some
heavily spammed categories in the Moscow area. It seems to be working for fake
real estate ads.

(We have a partial solution - find the real-world business behind the web site
and check it out in hard data sources, such as Dun and Bradstreet or Experian,
which have business credit data. See "sitetruth.com/doc".)

~~~
rjaco31
>A classic example of this can be found by searching for "laptop repair
bradford pa". This brings up "Illusory Laptop Repair", located in the middle
of a railroad crossing. A SEO expert created that phony business listing to
demonstrate how bad Google was at detecting such spam. Google still thinks
it's real.

This doesn't seem to work.

~~~
FeeTinesAMady
It does work on Google Maps, I find, but not on the main search page.

This is what I found:
[https://maps.google.com/maps?q=laptop+repair+bradford+pa&hl=...](https://maps.google.com/maps?q=laptop+repair+bradford+pa&hl=en&ll=41.955948,-78.643141&spn=0.0031,0.004823&sll=37.0625,-95.677068&sspn=53.477264,79.013672&hq=laptop+repair&hnear=Bradford,+McKean+County,+Pennsylvania&t=m&fll=41.955948,-78.643141&fspn=0.0031,0.004823&z=18)

~~~
Animats
If you're logged into Google, your results may vary.

------
mark_l_watson
A really nice idea.

I volunteered a bit early this year for Common Crawl (not much, just some Java
and Clojure examples for fetching and using the new archive format).

Common Crawl already has many volunteers (and a professional management and
technical staff) so it would seem like a good idea to merge some of the
author's goals with the existing Common Crawl organization. Perhaps more
frequent Common Crawl web fetches and also making the data available on Azure,
Google Cloud, etc. would satisfy the author's desire to have more immediacy
and have the data available from multiple sources.

~~~
mark_l_watson
some edits:

Most of the Common Crawl data is on Microsoft Azure, but not all of it.

The Common Crawl is a great resource that deserves attention from more
companies and developers.

------
JDDunn9
I've always wanted to experiment with my own search algorithm. Unfortunately,
I think this is still out of the budget of average programmers. Just the hard
drives to download 1.3 petabytes would cost six-figures.[1][2]

[1] [https://www.backblaze.com/petabytes-on-a-budget-how-to-
build...](https://www.backblaze.com/petabytes-on-a-budget-how-to-build-cheap-
cloud-storage.html)

[2] [https://www.backblaze.com/blog/why-now-is-the-time-for-
backb...](https://www.backblaze.com/blog/why-now-is-the-time-for-backblaze-to-
build-a-270-tb-storage-pod/)

------
smoyer
A couple thoughts:

1) I like the idea of human curation, but in combination with some sort of
automated crawler (or other tool) that helps in the browser.

2) Why can't we also distribute the act of crawling, the maintenance of the
index and the map-reduce (or other algorithm) that produces the data.

I've been thinking about architectures that would allow (in essence) a P2P
search system. Would anyone be interested in talking about architectures to
make this work? There are millions of computers on the web at any given time
... if it's built into the browser (or plugs in), you could have human input
at the same time.

------
andrewhillman
Yeah, this sounds all well and good in theory, but after visiting thousands of
sites over the years, it might be a better idea to help engineers build a
search engine for their own site/data first. I can't recall many websites that
have amazing search. It's a problem when I have to use google to find what I
want on xyz.com because if I go search for what I am looking for on xyz.com I
cannot find it even if I know its on that site.

It would be so nice to go to xyz.com and actually find what I am looking for
in under 1 second.

~~~
arthurcolle
I'm pretty sure Elastic is on the right direction in this regard, don't you
think?

~~~
andrewhillman
I don't think so. I just went to their site and the first case study i opened
was from theguardian.com. I went to theguardian.com and went to do a search.
Guess who they are using to power their search function? Google.

In my opinion which means nothing, sites need to figure out how to power their
own search. Using a third party isn't going to work for most. Maybe people
need to focus on building custom architecture that indexes the data in a more
structured way rather than cobbling systems together that ultimately hinder
search efforts when its time to get the user what they want. I don't know the
answer but somebody eventually will. Maybe wordpress will create a powerful
search for all those wordpress sites.

~~~
arthurcolle
Maybe I'm beating a dead horse but I feel like you could create a pretty
compelling search engine using ES. I used it this summer at Goldman and it
seems like it will really change the landscape... It's just so fast. I mean
full-text-indexing seems to be a pretty integral part of search. Maybe it's
just the first step, with the second step being a well written ranking
function, but that's just my thought.

------
angersock
For anyone interested, there's a hilariously bitter and practical paper on the
trials and tribulations of building a search engine:

[http://queue.acm.org/detail.cfm?id=988407](http://queue.acm.org/detail.cfm?id=988407)

EDIT:

Article is clearly from an earlier era, but it's really cool to see how far
we've come and how much more computing power we have available now. There are
entire categories of problems that simply don't _exist_ anymore.

~~~
Animats
She later designed the search engine of Cuil. While Cuil failed, it only cost
them about $30 million to do most of what Google does.

It's surprising to me that there aren't search engines from Comcast, AT&T, and
Apple. If you have customers, why give up all that ad revenue to Google?
Google is paying some big players a lot of money not to do that. They were
paying Apple $1 billion a year to be the default on Apple products. Apple
switched from Google to Bing anyway.

~~~
desdiv
_While Cuil failed, it only cost them about $30 million to do most of what
Google does._

They raised ~$30 million in two rounds, but their _valuation_ was at $200
million by round two. I agree with your point though; the cost to develop a
good search engine is dirt cheap compared to the value it brings.

~~~
Animats
"Valuation" by whom? They had no revenue, no revenue model, no VC would give
them additional funding, and Google didn't buy them out. On September 17, 2010
at 1 PM, all the employees were told the company was shutting down.

Google did hire the CEO and Anna Patterson to keep them from doing another
search engine.

------
ryanthejuggler
This would be really cool to participate in, especially if it could be
packaged à la Folding@Home/SETI@Home and widely distributed. I wonder if
there's some clever method using crypto that can provably discourage bad
actors if the network has certain properties (e.g. Bitcoin is nearly
impossible to cheat unless one group owns >50% of the network).

------
discardorama
Google's power comes not from the crawling, but from the retrieval and
ranking. They use many more signals than the hyperlinks and anchor text (which
is all you'd have if you crawled yourself). Indexing crawled content would
have been OK in the year 2000; but today, the users demand more. Relevance is
the top priority, and no one does it better than El Goog.

~~~
threeseed
Sorry but Google's ranking algorithm for me is far from brilliant.

To give you an example, search for "webhcat primary key" (without quotes) and
note how the top three search results do not actually contain the term
webhcat. Google constantly does this. It randomly ignores search terms unless
you explicitly quote them.

I believe that there is still a market for a technical/advanced search engine.

~~~
jobposter1234
Isn't google doing that because it detected the semantic information was on
the page, even if the exact term wasn't? Is your issue with the fact that
they're doing more than just a keyword retrieval, or is your issue with the
fact that they're doing it poorly?

~~~
threeseed
Isn't my issue obvious ? I wanted search results that contained the search
terms. Otherwise I wouldn't have entered them in the first place.

I understand Google is trying to be clever here and appealing to novices who
don't really understand what they want.

But my point is that for those of us that do it is an incredibly annoying
"feature". Feature is in quotes because in the specific case above they didn't
find semantic equivalents. They just dropped the "webhcat" term entirely.

~~~
jobposter1234
>Isn't my issue obvious ?

Not touching that with a 10 foot pole... (just teasing, just teasing...)

But seriously, it sounds like what you want is a keyword matching engine.
Google, for better or worse, has decided they know enough about their user's
searches that they don't mind modifying the query parameter in an attempt to
retrieve what people want rather than what they literally say they want.

I understand that you don't feel they're serving your needs any longer, and
that can be frustrating. I think, however, that you can preceded mandatory
terms with a + sign to require it to be present on page.

~~~
ahpeeyem
After Google Plus was released, Google changed its search syntax slightly so
that you now need to "quote" mandatory terms instead of prefixing them with a
+ symbol.

[http://waxy.org/2011/10/google_kills_its_other_plus/](http://waxy.org/2011/10/google_kills_its_other_plus/)
[https://productforums.google.com/forum/#!topic/websearch/3oI...](https://productforums.google.com/forum/#!topic/websearch/3oIWbew9xdE%5B1-25-false%5D)

~~~
cogburnd02
This is really weird, because now I have to enter this:

"singer" "song" -inurl:(htm|html|php) intitle:"index of" "last modified"
"parent directory" "description" "size" "m4a"|"mp3"|"ogg"|"flac"

to search for songs. _Too many quotation marks!_

original, unmodified: [http://lifehacker.com/207672/turn-google-into-your-own-
perso...](http://lifehacker.com/207672/turn-google-into-your-own-personal-
free-napster)

------
swah
Maybe more people should start crawling and seeing what they can extract? I
remember seeing DuckDuckGo Instant Answers and thinking what a valuable
resource that would be (having a database like DDG must have, I mean).

Then one would be able to do some "stuff Google can do" \- say, analysing
trends - albeit with worse sampling, and not depend that much on them.

------
sparkzilla
The problem with algorithmic/scraper search methods, is that they only work
with existing data. For example, most Google searches gives a list of websites
on one side, and some data scraped from Wikipedia on the other. There is not
much meaning there. That's because Google's algorithm cannot combine the
results into something original, because that would require human creativity.
As such, I see the rise of different kinds of search based on what humans
create, rather than what computers can scrape. I wrote a (longish) blog post
on this problem: [http://newslines.org/blog/googles-black-
hole/](http://newslines.org/blog/googles-black-hole/)

~~~
minthd
You really discount the value of the long tail of search, which is where you
get best info from google.

~~~
sparkzilla
You may get the best (more precise) info in the long tail, but I'm pretty sure
that Google makes most of its money from the most popular searches.

------
dmritard96
suprised not to see a mention of a bloom filter in url dedupe. Another tough
problem now is the portion of the web in walled gardens or that is expensive
to crawl (needs a js context).

------
mjklin
I thought Wikimedia tried this once. Big announcement, then nothing. Is that
code still available?

~~~
Arkanosis
That was Wikia, not Wikimedia, but yes, the code is still available: \-
crawler:
[http://sourceforge.net/projects/grub/](http://sourceforge.net/projects/grub/)
\- search engine: [http://nutch.apache.org/](http://nutch.apache.org/)

~~~
runarb
If my memory serves me correctly so is it only the client part of grub that is
open source. Without the server part on cannot use it to setup one’s own
crawl.

------
thewarrior
Hmmm I'd think that ChuckMcM would have some interesting views about this.

------
imranq
what about Algolia? HN uses it

------
smartpants
Good Read

