Hacker News new | past | comments | ask | show | jobs | submit login
SeenBefore: A search engine for what you have seen before (seenbefore.com)
175 points by chrishan on Sept 24, 2012 | hide | past | favorite | 95 comments

Obvious point to raise: the reason people regularly delete their browser history is because they watch porn without turning on private browsing. How do you propose to deal with this?

You'd need to provide at least the ability to selectively delete portions of the history. But you can selectively delete portions of your browser history too, and people don't - because it would be too easy to miss something. Instead, they just nuke the whole thing. How is your tool different?

I take advantage of my browsers history with porn. When I was in college the first bash script I wrote was to open a movie in my porn collection that I hadn't watched in the longest time. This was great. But now with streaming porn sites I don't have a huge collection and I often watch scenes that I can't seem to find later. There is a lot of porn out there.

Sure people clear their browser history because their embarrassed by their porn obsession, but I think this tool could be very useful for pornaholics too.

You could do something similar today—just hook into your browser's history API

Vinny Glennon, One of the founders here. Thanks very much for the up votes. The Chrome extension does not work in private browsing. I have a set of porn sites(1.7 million stored in redis) that I check if incoming links are a member of. You can selectively block sites ( https://www.seenbefore.com/blacklist_items).

Where did you get your list from, and can you share it?

For research purposes.

For science.

Do you store the entire set of 1.7 million entries in redis? Or is redis an index to data stored elsewhere, in a relational DB perhaps?

I was under the impression that redis wouldn't be all that useful to store a lot of data. Would be great if something as quick as redis could work with large data sets.

Storing a list of 1.7 million strings for us takes 70mb stored in memory. Testing for membership is an O(1) op. Very happy with it. We use mongo as a dumb data store as well as a bunch of other infrastructure tools, like http://circleci.com we could have only dreamt of years ago.

> Testing for membership is an O(1) op

Curious how. O(1) an array index lookup, not a string lookup, I thought.

Again just curious, but I'd still like to know how. Someone there asks the mod how it could be O(1), the mod replies it's a "hash table lookup". But Wikipedia at http://en.wikipedia.org/wiki/Big_O_notation suggests that such lookup is no faster than O(log log n). I think the redis info is incorrect.

O(1) implies that the location of the member in the list is already known, with no search required. I don't see how that could be the case when it's a key lookup. The key could be anywhere in the list, even if the list is sorted. They key would have to be searched for, it seems.

O(1) means constant time. Redis sets are interesting, because there are a few possible implementations under the hood, but the typical case winds up implementing it as a hash table. Hash tables have constant time lookup.

Think of it this way (this isn't literally what happens, but it's close).

1) Take the url you're looking for. Run it through a hash function. This takes an (amortized) constant amount of time.

2) Now you have an index to check. (the return value from the hash function). So index into your actual table, and check to see what's stored there. If there's a value stored there, then the url is a member of the set. This also takes a constant amount of time.

Does this help?

> So index into your actual table, and check to see what's stored there

But that check isn't a constant time lookup. The lookup time can vary. (Analogously, a lookup in a phone book can vary in time; we can't necessarily go to the exact spot the first time.) So the total time for both steps must vary as well. I think.

I think where you're getting confused is that you're conceptualizing this like a search problem, where you compare values and inspect each member to see if it matches the target.

That's not what's going on here. Instead, you use the value as in input into a function that tells you where to look for it, then you look, to see if it's there.

If it's not there, it won't be anywhere else, so you don't have to keep looking. Things get interesting with collisions but that's a subject for another time.

OK, thanks, I'm starting to understand. Good explanation. I quit being lazy and searched around too, to see that it's possible.

Why not use a bloom filter?

The main benefit of Bloom filters is that they can be made small. Given that his database takes only 70MB or so and he's not trying to ship this to devices that might have much in terms of space limitations, there would appear to be little point.

Eh, true, I guess redis is sufficiently awesome.

Maybe because of this fact (according to Wikipedia)?: "The more elements that are added to the set, the larger the probability of false positives."

That depends on its size, though. You can make it larger and get fewer false positives.

Wouldn't one simply use one specific browser, say either safari or firefox or chrome, and only that browser for their... unsavory activities? I think that is a great way to keep accounts separate and keep "bad" sites from knowing about "good" sites and vice versa. Just saying. Not that I partake in any such unsavory activities.

For testing purposes I use Chrome's "Users" feature to keep an extra profile with no extensions installed handy.

The same could be done for a "Porn" profile too I guess, sand-boxing any history, extensions and bookmarks to that profile. You could even associate tie it to a Google account for portability.

This problem is nullified by private browsing. I think the idea is BRILLIANT, as Google's already tracking all my 'legitimate' searches, and I find that most of what I Google are things I've looked at on other machines, or seen already.

The noise introduced by phrasing my query differently is a real problem in search that Google hasn't fixed yet.

Porn sites are not recorded

How does your system define "porn sites"? What about if it was some porn site no one has ever heard of with an innocent-sounding name/domain?

He apprently uses a list of 1.7 million sites. But you can also blacklist sites and have any existing entries for it removed:


Beat me to it! This was something I had been planning to build on my own for a while, but didn't get around to . Congrats!

Whenever I have tech discussions with friends I would recall something mentioned in a article I read via HN. But it would take me a whole lot of effort to get that link. Oftentimes I simply couldn't get hold of the link even after an hour of searching.

Please do get the Firefox extension out. Would love to use it. Also, please do make sure the extensions/addons are stable. Have been facing problems with Annotary's extensions [1], for instance.

By the way, do you have a crawler fetch the link content or do you send it from the user's browser?

[1] https://getsatisfaction.com/annotary/topics/unstable_browser...

cofounder here. Our first version spidered out for the content but a far more efficient way was to upload compressed version of the data from the user as we can then do hash checks for reference counting. Chrome extension has been used in the wild for last 3 months on 6 continents. Firefox extension too unstable at the moment(also Mozilla ten day review process), but hope to get it out with 1-2 weeks. Would love any feedback, good or bad!

Was mainly concerned about the scalability. For a large number of users, your server would have to handle a large number of concurrent connections while they uploaded their data. If you used spiders, you could push the URLs to a queue and process them at your convenience.

How do you deal with 2 users looking at the same URL but seeing different things? example.com/me would be different for user1 and user2.

Some pages would be very dynamic, eg. Facebook. And not everyone browses facebook/twitter behind https (which you do not index). Do you not index social networks?

I like the fact that the extension requires no user input and works silently in the background. Has some trade-offs, but worth it. Cannot comment on the search quality yet because Chrome is only my secondary browser; not enough history to search for anything meaningful.

Few annoyances I noted in the FAQ "What Google search sites does it support?" section. google.co.in is by default in English, you would have to explicitly set it to another language [1]. "Indian" is not a language (Hindi, Malayalam, Bengali etc. is [2]). Farsi is not spelt Farsai [3]

[1] https://www.google.co.in/?hl=ml

[2] http://en.wikipedia.org/wiki/Languages_of_India

[3] http://en.wikipedia.org/wiki/Persian_language

Fixed. Switched example to Turkish and Iranian, as that is were most of your traffic came from this post. Just read an 800 page book on India, can't believe I made that mistake.

It's, unfortunately, a common mistake.

You need to work on stemming and clustering terms, I think... I've visited a number of Postgres related pages, and some of them contains only "Postgres" while others contains only "PostgreSQL", and searching for Postgres will only give me the former pages. It confused me for a little bit.

Interesting idea. Some quick questions:

- How much data do you store per user?

- How do I delete certain results? (preferably after the search comes back)

- Another thing to consider is - After how much time does this just become as painful as finding that page through a search engine?

- What version of the page gets stored? The latest or the one that I saw?

I guess its one step better than Evernoting a page and adding tags myself.

Good luck!

Main issue with Evernoting and Bookmarking is that it requires an effort to say that today, this page is useful and I want to store it. Most pages I want to find are very things I did not think was useful at the time. Each unique page(unique as per the content) is stored per user. Our goal is to build the tools needed to find the information quickly, similiar to what hipmunk.com did for airline search. We have the added dimension of time to use.

The main thing I use bookmarks for is categorisation. If you add the ability to tag and/or add notes that becomes part of the search terms, that'd be the killer feature for me - I could throw out my 3500 bookmarks and remove Xmarks (at least if we could get a way of automatically getting our existing bookmarks installed).

I'm a paying Xmarks user, but if you were to add a way of tagging sites or adding a note, I'd happily pay for this instead. Just a freeform text field that I could add some keywords into that gets treated as part of the search would actually be sufficient for me.

Some of the things I "see", I would prefer they never show up in a Google search.

I'm sure there is a configuration setting somewhere to deal with that, but it would be yet another thing to take care of.

One feature that could help this: verifying that the account holder is the one using the computer, before showing results.

Without this, assuming this plugin is always-on on all the computers one uses, breaking user's privacy just becomes too easy.

And there's a lot of data one might want to leave private except porn(and usually don't post them in facebook): medical issues, sexual issues, marriage and some other relationship issues, drugs issues and probably others.

You could hook it into Chrome's history API[1].

[1] http://developer.chrome.com/extensions/history.html

If it would let people search there history and bookmarks, they could start benefiting from it right away.

Hmm, this is similar to http://historio.us, which I built. However, this doesn't require any user interaction, which might work well.

Do you store just the URL and depend on Google returning the results? How does it work exactly?

I remember thinking that http://historio.us was a neat idea.

But Seen Before requires less effort on my part as a user -> I am more likely to use it. I just continue to google as per normal and now I have an extra option on the right to filter results.

This is true. The use cases are a bit different, but I still don't know exactly how this works so I can't say.

"SeenBefore stores your information securely in the cloud from your work or home computers.

So no matter where you read it you can still search for it even when your browsing history has been deleted."


I think it's a good idea but I think many people would need convincing on the security front.

Who am I handing my data over to? I can't find this anywhere on the site.

Added a team page: https://www.seenbefore.com/pages/team . This had dropped off our list until launch. Sorry.

And would it be possible to configure it to use my own "cloud"?

Definitely something we are looking into. Major barrier is the cost for someone keeping a server running 24*7 in cloud(Micro instance on AWS is 175 dollars a year).

Some of us already have servers running 24-7 in the cloud. I have two, for example.

Lots of us here have our servers.... Personally, being able to point it at one of my own servers and/or getting an API, would be fantastic.

What about using your server to run the software, but flat files as storage, I could point it to dropbox for example?

Yes, I have seen it before! I have build a personal search engine MindRetrieve back in 2005.


Specifically I'm not comfortable for big web company to keep the history of my web activity. So I make it work completely locally. My project did not get much uptake, probably my lackluster marketing and other assorted issues are to blame. So good luck on this one!

Too bad. Looks like a really cool project. Even more impressive when seeing how old it is.

Mac version? Planned six years ago?

Co-founder here. This took us by surprise, we were planning to have Firefox and Safari support done by launch. At this stage, it is priceless to know if we are solving a real problem people have. Also, is this something people would pay for (loops back to if this is enough of a pain point). From the moment we start charging, is the moment we start learning.

See comment elsewhere: With tagging or (simple plain text) notes attached, absolutely. Even moreso with a simple API and/or support to push the cached content to my own server. If it could be selectively enabled for private content too, then even better (e.g. there's several extensive private Wiki's I use regularly that are not sensitive enough that I'd worry about getting them indexed, and I'd love to be able to tell you to index them but perhaps disable the caching).

I don't spend much online, especially when it comes to recurring fees. However, I use Pinboard enough that it's going to be hard for me to resist not paying the $25 fee they charge for archiving bookmarked pages for the second time.

Yeah I could code/hack together something myself and have been thinking of doing it [for fun], but ya know :p

So, yeah, count me in as being interested.

"40% of searches online are people simply looking for what they have already seen before." - How did they calculate this statistic?

"According to Yahoo, 40% of searches are simply searching for what you saw before."


They should link to the study.

Linked: https://www.seenbefore.com/pages/faq#currently_do . Thank you so much! :)

If I remember correctly, this was the result of some research done by a startup. Or was it Google? I couldn't tell you because this service didn't exist back when I read that article

I get that deployment is easier when it is vendor hosted, but this really should be a local app using local storage, withe maybe transient server-side storage for syncing between machiens.

Dup of Archify?


> 40% of searches online are people simply looking for what they have already seen before.

Citation link needed.

Citation link: http://cond.org/sigir07.pdf [PDF]

Information Re-Retrieval: Repeat Queries in Yahoo’s Logs

Abstract: "This paper explores repeat search behavior through the analysis of a one-year Web query log of 114 anonymous users and a separate controlled survey of an additional 119 volunteers. Our study demonstrates that as many as 40% of all queries are re-finding queries. Re-finding appears to be an important behavior for search engines to explicitly support, and we explore how this can be done."

Wow, does 240 people even count as a sample. At Yahoo and Google log sizes its probably the error from cosmic rays in the data center.

If they selected them in a properly random way and had an effect close to 40% then yes, that probably does count as a sample.

As someone who signed up to coursera stats 101, err... Why 40%?

I am making some assumptions here absolutely, but because 40% is a large effect you don't need as many samples to be confident.

The other way of looking at it is that maybe it's actually 35% or 45% but either way, that's still interesting, even with a rougher approximation of the actual "answer". If, for some reason, you needed to know if it was 40% or 40.01% because that mattered to you then you would absolutely be annoyed at the small sample size.

If the finding was 2% then we would care about the uncertainty of +/- 5% since the finding is dwarfed by the error rate. That's a smaller effect size so you would need more samples to separate reality from the noise.

I am, by the way, pulling all of these numbers out my ass. Your stats 101 class will teach you the formulas to calculate the actual error bars at work here as well as the assumptions you need to make about the distribution of the data to use those formulas.

I think this is a great idea. I've been using Opera, which has a full-text search capability for history, but it's limited to the machine you're using it on.

I often find interesting articles on Hacker News while I'm at home that I want to find again when I'm at work. Being able to search by browser history across machines is fantastic for me.

YES! I've been looking for something like this for ages for stuff I have read on Hacker News.

I use a system adapted from http://www.gwern.net/Archiving%20URLs to archive every page I've bookmarked (using FF) in the previous month. Then I just query with local tools.

Not ideal, several flaws, but works well enough for me so far.

I'll take it for a spin... this is something I've wanted for a long time.

I was going to hack it by making chrome bookmark every site I visit with a tag:history then when I wanted to search for a site that I've already visited I was going to just search with that tag.

Doesn't Google already have this? Go to...

Show Search Tools -> All Results -> Visited Pages

I'm pretty sure that's only for filtering pages visited via a prior google search

In theory Chrome lets you search through your history for pages, but it doesn't seem to actually work very well for me.

Similar to my project Peerbelt.com. A notable difference is Peerbelt runs entirely on the client to void privacy concerns. Vinny, let's chat and see if we can collaborate. Cheers, -Krassimir the Peerbelt founder

There's also weekly reports that tell you what sites you've been visiting the most, what time of day/what days you visit sites most, and how many pages SeenBefore added to your file.

In a similar vein, Pinboard offers to snapshot and full text index all your bookmarks, for a small annual fee:


I'm going to give it a spin and let know what I think (it'll take a few weeks of usage), but I can tell you right now that it's definitely solving a real problem I have.

I am going to try this out because it seems like what I spend a large portion of my time doing. The security and privacy of this scares me a lot though.

I love this idea but I think you will find more traction by turning it into a kind of bookmarking app with less focus on the search engine part.

I love the date visualization. This is something I think that pretty much all search results could benefit tremendously from.

How is this different from Google Search History?


GSH searches only within your Google search history. This guy searches through your entire browser(s) history.

Thanks, I was confused by the fact that this integrates with Google Search.

Great idea, I assume with Chrome's new incognito browsing, this won't be picked up on seenbefore or am I wrong?

Looks useful. I've been well aware that google tracks everything I search for but I still don't like it.

I don't know why I should give you my browser history. I'd like to keep it to myself.

I agree, sounds like a crazy thing to do when this could easily be achieved locally on my machine. Or am I missing something ?

But if a small piece of software was installed on your machine it wouldn't be "in the cloud". We know that makes everything better. Ok well not application performance ... or cost ... or usability ... but still, "the cloud".

For those using Firefox, there is a similar add-on called RecallMonkey.


Great idea. Love it.

great idea and works brilliant!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact