

Full MongoDB database dump of the Blippex search engine - karli
http://blippex.github.io/updates/2013/07/23/First-database-dump.html

======
mgamache
So it's a bunch of internet URLs without content or content metadata?

~~~
bkanber
Seems that way. I mean, that's still be valuable and interesting for other
reasons, but let's not call it a "dump of a search engine". There's nothing in
there that's actually searchable!

Still, nice gesture by Blippex. Somebody will find something interesting to do
with this, even if they just use it educationally.

------
powertower
> Last friday we reached the milestone of 50k searches per day. Today we are
> releasing as promised the first dump of our database.

I hope whatever they are releasing, is not the user search query data.

Remmember when AOL did that?

[http://techcrunch.com/2006/08/06/aol-proudly-releases-
massiv...](http://techcrunch.com/2006/08/06/aol-proudly-releases-massive-
amounts-of-user-search-data/)

edit - okay, I think it's the index data.

------
rgiar
so this is just when a given site was crawled?

    
    
      "_id": "b919f02c8f053c41e8ee86311ca9b0f6,
      "url": "https://www.example.com/",
      "host": "www.example.com",
      "root": "example.com",
      "time_spent": [
        {
          "sec": 45,
          "seen_at": ISODate("2013-06-23T00: 41: 44.0Z")
        },
        {
          "sec": 5,
          "seen_at": ISODate("2013-07-01T14: 41: 44.0Z")
        }

~~~
karli
Hi,

yes, as it is said in the blogpost, the only thing missing is the full text of
the page for indexing & searching in it, we don't dare to release it because
of copyright issues (he, you distribute the full text of my page!).

With this data you could for example built a new alexa and find out what was
the most visited page last week :)

~~~
enigmo
How would we figure out which page was visited the most last week? Are these
crawl logs or access logs?

~~~
mgamache
It would be crazy cool to get a real-time feed of browsed URLs (not this dump
format). Kind of like the mythical Twitter fire-hose.

~~~
geraldbaeck
Nice idea I think we should do that.

Gerald, CTO blippex

------
itsmeduncan
This will be fun as a list of places to try out 0-day exploits on.

