

Cuil Crawl Data: 310 terabytes of compressed data, snapshot from 2007-8 - aw3c2
https://archive.org/details/cuilcrawl

======
fhars
He gives you a hamburger, but it turns out you don't actually exist. He gives
you terabytes of data. Something good comes out of it in the end.

~~~
alex_c
For those wondering:

[http://www.reddit.com/r/worldnews/comments/7da5i/police_raid...](http://www.reddit.com/r/worldnews/comments/7da5i/police_raids_reveal_baby_farms/c06cqxb)

~~~
daurnimator
care to explain/give further context? Even looking at that whole reddit thread
I don't understand.

~~~
yathern
From what I can gather, Cuil was a sort of search engine that created
summaries of pages using some algorithm. This algorithm often spliced together
strange sentences without context. If you look at the whole thread, it seems
that the person before the linked comment makes an analogy between Reddit's
algorithm to determine thumbnails for a given link, and Cuil's algorithm to
generate summaries.

Then, the user that was linked to, goes on to describe a technique to use the
term "Cuil" as a unit of measurement, for how disjointed something is from
something else.

~~~
ArcticCelt
Also, to give more context, that was not too long after the initial launch of
Cuil and the feeling on reddit at the time was that Cuil search results where
pretty bizarre. So reddit did what it do best and made a meme of the whole
thing.

------
bluedino
I'd love to hear the story behind getting the data to the Internet Archive,
from Cuil. 310TB is a massive amount of data, and was even more massive in
2007. The biggest hard drive you could buy were only 500-750GB. If they were
SAS drives they were probably only 300GB at the biggest. Did they just throw
6-8 racks worth of a SAN onto a truck and ship it over? What did they
(Archive.org) do with it once they got it?

~~~
adventured
Here are some photos of their hardware (business insider claimed their local
setup held 7 petabytes in the article circa Nov 2009).

[http://static.businessinsider.com/image/4af8564f0000000000e6...](http://static.businessinsider.com/image/4af8564f0000000000e660d4-590/their-
local-data-center-holds-7-petabytes-of-data-heating-is-a-big-issue.jpg)

[http://static.businessinsider.com/image/4af84e320000000000b7...](http://static.businessinsider.com/image/4af84e320000000000b72feb-590/server-
cabinets-that-have-been-built-but-not-yet-deployed.jpg)

[http://static4.businessinsider.com/image/4af880fe00000000009...](http://static4.businessinsider.com/image/4af880fe00000000009cd72c-590/cuil-
gets-about-five-complaints-a-day-from-webmasters-worldwide.jpg)

[http://www.businessinsider.com/cuil-office-
tour-2009-11#cuil...](http://www.businessinsider.com/cuil-office-
tour-2009-11#cuil-is-currently-headquartered-in-menlo-park-in-an-building-
once-occupied-by-tshirt-company-zazzle-1)

------
hosay123
So for about $12,000 in bandwidth fees, $17,000 in 4TB SATA drives, and say
another $15,000 on 10 2U hosts, you could have a slightly stinky copy of the
Internet that would fit in a half-height rack.

That's freaking awesome.

~~~
deletes
Even more interesting is, will in the future that space that represents the
entire internet ( the rack ) grow in size or will disk storage improve faster
than the growth of internet when finally you will be able to fit internet on a
small hand held disk.

~~~
adventured
Media alone will vastly outrun micro storage capabilities over any relevant
length of time we can project (5, 10, 20 years). It might be an interesting
thought experiment regarding text (and probably only a fraction of a fraction
of that due to social media).

YouTube alone will guarantee you could never store the Internet on a small
hand held disk. They're adding 72+ hours of video to the service every minute.
And ten years from now, it'll all be HD+ content, and they'll probably be
adding 500 hours per minute or something similarly crazy.

I think Wikipedia is around 42gb right now, uncompressed (just the content
pages). I don't think that includes the images. So right now we're just to the
point where you can store a text Wikipedia on your smart phone with a $30 or
$40 sd card. I'd guess in eight to ten years we'll have 500gb to 1tb smart
phone equivalents depending on how that all evolves. You might be able to
store a dozen plus dual layer full blu ray discs on your smart phone in a
decade.

In other words, it's easy to say that we're not going to get anywhere near
storing even a sliver of the Web / Internet on a small device in our
lifetimes.

------
jameskegel
I can barely remember Cuil, but what I do remember is it being touted as a
"Google Killer". Funny how that worked out; I wonder where they went wrong. Oh
that's right, I searched for Fedora Linux and got pointed to a menswear
retailer in the south of France.

~~~
treerex
They touted _themselves_ as a "Google Killer". They basically came out of the
gate claiming to be the best search engine in the world, and then they went
down in spectacular flames. Whoever developed their launch PR should have been
put up against a wall and shot.

If they had rolled their technology with less fanfare they _may_ have made a
minor dent in the market, but even then its unlikely. But at least then they
could have taken their IP and gone into the Enterprise Search space and
perhaps have gotten bought out by someone evil.

~~~
sp332
The press picked them up before they were actually ready to launch. Their
search model was to segregate results into different categories and show all
the categories at once, in sections of the page. When the traffic spike, the
most popular categories went offline, leaving only the long-tail sections
active. Most people never saw how the search would have looked when the
software was finished and the servers beefed up.

~~~
ashrust
This is false.

At cuil we were told by mgmt of the launch date and our very talented PR team
kicked into gear and generated a ton of buzz via a nationwide press tour.

Despite popular opinion, the ops team kept the site up during the massive
traffic surge. A combination of poor mgmt, initial deference to user testing,
last minute commits and a nasty indexing bug were the reasons the relevance
sucked on day 1.

------
NelsonMinar
Anna Patterson, one of the Cuil founders, previously worked at archive.org and
built an early search engine over their archive (named Recall, IIRC).
According to her LinkedIn profile she currently works at Google.

~~~
ashrust
It wasn't just Anna, there were many former archive staffers at cuil. I
suspect that's how the data ended up being transferred.

------
gnu8
I can't wait to check this out in my Flock web browser!

------
xhrpost
Is the actual data available for download yet? I'm just seeing hundreds of
meta-data XML files.

~~~
Mithrandir
I don't think so. If you go to one of the items (e.g.
[https://ia601408.us.archive.org/18/items/cuil-domainshard-
co...](https://ia601408.us.archive.org/18/items/cuil-domainshard-
corpus5-large-merge-rev1.00355-of-25000/)) and try downloading one of the
.arc.gz files, you'll get a message saying "The item is not available due to
issues with the item's content." I've seen this happen before with other
crawls, like for Youtube ([https://ia600301.us.archive.org/26/items/IA-
YOUTUBE-000-2007...](https://ia600301.us.archive.org/26/items/IA-
YOUTUBE-000-20070509025513-45711-crawling02.us.archive.org/))

It could be a copyright issue or to keep bandwidth low, but I don't know for
sure.

------
jbellis
Will this be integrated with the Wayback Machine?

~~~
vijayr
how effective will it be? It is 5 years old

~~~
mjn
The Wayback Machine archives old snapshots of pages as they appeared at
various times. If the Cuil crawler has snapshots from 2007-08 of pages that
aren't already in the Wayback Machine's own snapshots from that time (the
Internet Archive crawler doesn't manage to get to everything), they could be
used to supplement the archive.

------
taylorbuley
Is this a product of Cuil's infamous Twiceler bot?

------
bediger4000
Has the NSA downloaded it yet? That would be a good start to the Utah Data
Center's contents.

~~~
wmf
The way they're spliced in means they don't have to crawl anything. When you
surf the Web you're crawling for them.

