
Greplin: 1.5 Billion Documents Indexed, Six Engineers - jamesjyu
http://techcrunch.com/2011/04/27/greplin-1-5-billion-documents-indexed-six-engineers/
======
pw
Makes me think of this Quora question: Is it true that size of the portion of
the web that Google indexes is actually smaller than sum of sizes of the
contents of everyone's Gmail? ([http://www.quora.com/Is-it-true-that-size-of-
the-portion-of-...](http://www.quora.com/Is-it-true-that-size-of-the-portion-
of-the-web-that-Google-indexes-is-actually-smaller-than-sum-of-sizes-of-the-
contents-of-everyones-Gmail))

Is it fair to say that the size of the "private" web (what Greplin aims to
index) is, in aggregate, larger than the public web? And are there any amazing
things that become possible once you've indexed a large portion of that
private web?

~~~
aik
Good question. With my understanding, being that Greplin has access to all
private information, they have an incredible amount of power -- in fact more
than Google who just has one/two aspect(s) (albeit large ones). Just for the
sake of privacy, imagine Greplin agreeing to giving up private user
information to the gov't, just like all these other companies? They'd have
access to everything.

Scares me a bit too much to sign up for the convenience.

~~~
alnayyir
Want one you can deploy privately?

~~~
aik
Yes very much so.

~~~
jamii
Ahem - <https://github.com/quartzjer/Locker>

~~~
alnayyir
This gets mentally filed under the same category as, "you should be checking
to see if a library does it for you before you start coding".

------
Tibbes
I'm guessing that: a "document" on Twitter is a single tweet; a "document" on
Facebook is a wall-post or equivalent; a "document" on GMail is an e-mail; a
"document" on Google Calendar is an appointment.

Therefore, the comparison with Google’s web-wide index in 2001 is a little
misleading (in terms of the amount of data), given that the average size of a
web-page is greater than all of these.

Of course average size of a file on Dropbox is likely to be larger than a
webpage. I wonder what percentage of those 1.5 billion documents are files on
Dropbox.

~~~
tsycho
greplin doesn't index content within files on dropbox, just the filenames.

I am building a startup that does that i.e. it indexes your doc/pdf files
(more formats coming), and allow you to instantly search through them. It's
called grepfiles.com, but is in very early stage (pre-alpha), so go easy on it
since I am not sure how well it scales. Mail me at mail@asif.in if you have
any feedback. Would really appreciate it.

------
rakkhi
What does the HN community think of the greplin concept? They have recently
added a Chrome plugin and a greplin search replacement for standard email
search.

Think it is a public beta and anyone can signup if not ping me and I'll send
you an invite.

My main concerns with the service are: \+ Centralized risk - keys to very
valuable kingdom \+ No two factor - but they tell me its coming \+ No word on
whether they encrypt in storage - although it should only be an index to the
information rather than the actual info \+ Standard SAAS / Cloud risks -
internal abuse, legal turnover etc.

Any others? All of these could be mitigated to a reasonable degree. What do
you think? Is there a future for this type of service (or big buyout for
Google / Bing) or is it just too scary?

~~~
oscilloscope
It's an incredible amount of personal data. If all that data was collected,
then abused, I'd dissociate from much of my identity. I would just feel
totally alienated by post-industrial society.

I'd be okay using Greplin if I knew Google was going to acquire them. I trust
Google. I figure when Google goes bad, there will be much bigger issues facing
humanity and our internet pasts will all be damning anyways.

~~~
jwr
> I'd be okay using Greplin if I knew Google was going to acquire them. I
> trust Google. I figure when Google goes bad, there will be much bigger
> issues facing humanity and our internet pasts will all be damning anyways.

I think it is a sign of the times that I read this paragraph, though it was a
subtle joke, reread it and decided it was serious, and then did some more
thinking about whether the author is serious or not.

------
webmonkeyuk
1.5B docs by just six people is impressive but I suspect that computers did a
bunch of the indexing work.

~~~
itgoon
Ha!

Just the logistics of _handling_ 1.5B docs would keep six people pretty damn
busy.

------
aik
I would get seriously excited about this if I could install it on my own
server and keep my own index. I'm a bit hesitant giving them access to all my
data to all my accounts, in exchange for a small convenience.

It is pretty impressive, though saying that it launched in February is
misleading. I signed up last year, ran into a bunch of problems with it not
indexing anything, and haven't opened it since. Now it looks like everything
actually has been indexed, which is cool. I'm deleting my account for now
though, as it doesn't yet seem easy enough to be useful for my purposes.

~~~
lehmannro
I regularly hear that _if I could install it on my own server_ argument and
wonder if you think you can handle security and administration much better
than someone who's paid to do it. I, for one, can't and would not want to
waste my time on it.

~~~
aik
I agree that is a good point. Perhaps it is a technology that would be better
off not existing?

Security aside, one of the fears I have isn't necessarily against hackers, but
against legal entities making use of the private information illegally, in
addition to Greplin selling "me" in a very compact and precises manner to
whoever they want.

------
g123g
Big Deal?

With cloud providers like Amazon providing computing power on the pay as you
go basis I am not sure why this is a news now days.

Some ridiculous comparisons are thrown about in the article -

same size as Google’s web-wide index in 2001

60 times the size of Google’s original 1998 index

I am not sure how to process and make sense of these comparisons.

~~~
mlinsey
It's a big deal because:

(a) It's a proxy for traction. Greplin indexes data that can't be crawled;
users have to authorize it to index their data. So aside from how hard of an
engineering feat it is, the fact that they've indexed this much data probably
means that they have a sizable number of users.

(b) While you're right that the technical challenge of indexing that many
documents is easier now than in 2001 thanks to things like AWS (and numerous
open source projects), to do it with a team of six is still impressive.

~~~
g123g
My gripe was with the article and not with what greplin is doing. Details like
what you have mentioned in point 1 would have made the article much more
useful rather than multiplying some random number from 1998 and expecting the
readers to have a wow moment. Some idea about how they actually index the
items, how they store this massive data, how the search is done to keep it
fast etc. is what I would have liked to read.

~~~
catshirt
this is a press release to techcrunch

they have an engineering blog as well <http://tech.blog.greplin.com/>

------
dacort
While I understand that real-time full-text indexing is a much more difficult
problem to solve, I've got just under 1.5 billion tweets "indexed" in
TweetStats. And I'm one person.

Granted, given the 30MM/day number they must be growing that index very
quickly and they've likely hit that 1.5 mark pretty darn quickly.

~~~
moe
_real-time full-text indexing much more difficult problem to solve_

Solve?

Greplin has probably not built their own search technology. I'd guess they're
simply running Lucene or Sphinx like everyone else.

Their index is still small by search standards, as you can tell from
TechCrunch having to reach 10 years back to make an "impressive" analogy.

Today, 1.5 billion documents translates to a couple terabytes of data
(probably high single digit). 30 million indexed/day translates to about
~400/sec. You could store and process all that on a single, beefy box. Or you
can spread it out over a couple amazon instances.

But yes, in 2001 this would have been impressive. In 2001 you'd pay $150 for a
40 GB harddrive...

------
ww520
While Greplin is impressive, it's not in the same scale as Google, even in its
early day. Google built an large global index for everyone, while Greplin
built many small indices for many users. Some calculation would illustrate the
point.

Google's global index: 1 billion documents. Searchable by 1 million users.
Need to support 1B x 1M search capacity.

Greplin's individual indices: 1000 documents/user for each individual index.
With 1 million users, there are 1B documents total. Each user only searches
his 1K index. Only need to support 1K x 1M search capacity.

It's orders of magnitude difference.

------
B-Scan
250M of documents per engineer. Not bad at all.

------
gubatron
I wonder if they use Solr to distribute their ever growing index.

~~~
sigil
Lucene, I believe.

<http://news.ycombinator.com/item?id=2443675>

------
lennexz
at 19 this young man is already doing big. I havnt tried greplin yet but I
think it has a very bright future

~~~
mindotus
Agreed and very impressive indeed.

I'm sure we'll be hearing much more from these guys.

