
SeenBefore: A search engine for what you have seen before - chrishan
https://www.seenbefore.com/
======
crntaylor
Obvious point to raise: the reason people regularly delete their browser
history is because they watch porn without turning on private browsing. How do
you propose to deal with this?

You'd need to provide at least the ability to selectively delete portions of
the history. But you can selectively delete portions of your browser history
too, and people don't - because it would be too easy to miss something.
Instead, they just nuke the whole thing. How is your tool different?

~~~
vinnyglennon
Vinny Glennon, One of the founders here. Thanks very much for the up votes.
The Chrome extension does not work in private browsing. I have a set of porn
sites(1.7 million stored in redis) that I check if incoming links are a member
of. You can selectively block sites (
<https://www.seenbefore.com/blacklist_items>).

~~~
andrewthornton
Where did you get your list from, and can you share it?

~~~
user24
For research purposes.

~~~
jparishy
For science.

------
gingerjoos
Beat me to it! This was something I had been planning to build on my own for a
while, but didn't get around to . Congrats!

Whenever I have tech discussions with friends I would recall something
mentioned in a article I read via HN. But it would take me a whole lot of
effort to get that link. Oftentimes I simply couldn't get hold of the link
even after an hour of searching.

Please do get the Firefox extension out. Would love to use it. Also, please do
make sure the extensions/addons are stable. Have been facing problems with
Annotary's extensions [1], for instance.

By the way, do you have a crawler fetch the link content or do you send it
from the user's browser?

[1]
[https://getsatisfaction.com/annotary/topics/unstable_browser...](https://getsatisfaction.com/annotary/topics/unstable_browser_extensions)

~~~
vinnyglennon
cofounder here. Our first version spidered out for the content but a far more
efficient way was to upload compressed version of the data from the user as we
can then do hash checks for reference counting. Chrome extension has been used
in the wild for last 3 months on 6 continents. Firefox extension too unstable
at the moment(also Mozilla ten day review process), but hope to get it out
with 1-2 weeks. Would love any feedback, good or bad!

~~~
gingerjoos
Few annoyances I noted in the FAQ "What Google search sites does it support?"
section. google.co.in is by default in English, you would have to explicitly
set it to another language [1]. "Indian" is not a language (Hindi, Malayalam,
Bengali etc. is [2]). Farsi is not spelt Farsai [3]

[1] <https://www.google.co.in/?hl=ml>

[2] <http://en.wikipedia.org/wiki/Languages_of_India>

[3] <http://en.wikipedia.org/wiki/Persian_language>

~~~
vinnyglennon
Fixed. Switched example to Turkish and Iranian, as that is were most of your
traffic came from this post. Just read an 800 page book on India, can't
believe I made that mistake.

~~~
gingerjoos
It's, unfortunately, a common mistake.

------
ankimal
Interesting idea. Some quick questions:

\- How much data do you store per user?

\- How do I delete certain results? (preferably after the search comes back)

\- Another thing to consider is - After how much time does this just become as
painful as finding that page through a search engine?

\- What version of the page gets stored? The _latest_ or the one that I saw?

I guess its one step better than Evernoting a page and adding tags myself.

Good luck!

~~~
vinnyglennon
Main issue with Evernoting and Bookmarking is that it requires an effort to
say that today, this page is useful and I want to store it. Most pages I want
to find are very things I did not think was useful at the time. Each unique
page(unique as per the content) is stored per user. Our goal is to build the
tools needed to find the information quickly, similiar to what hipmunk.com did
for airline search. We have the added dimension of time to use.

~~~
vidarh
The main thing I use bookmarks for is categorisation. If you add the ability
to tag and/or add notes that becomes part of the search terms, that'd be the
killer feature for me - I could throw out my 3500 bookmarks and remove Xmarks
(at least if we could get a way of automatically getting our existing
bookmarks installed).

I'm a paying Xmarks user, but if you were to add a way of tagging sites or
adding a note, I'd happily pay for this instead. Just a freeform text field
that I could add some keywords into that gets treated as part of the search
would actually be sufficient for me.

------
bambax
Some of the things I "see", I would prefer they never show up in a Google
search.

I'm sure there is a configuration setting somewhere to deal with that, but it
would be yet another thing to take care of.

------
ippisl
One feature that could help this: verifying that the account holder is the one
using the computer, before showing results.

Without this, assuming this plugin is always-on on all the computers one uses,
breaking user's privacy just becomes too easy.

And there's a lot of data one might want to leave private except porn(and
usually don't post them in facebook): medical issues, sexual issues, marriage
and some other relationship issues, drugs issues and probably others.

------
beaumartinez
You could hook it into Chrome's history API[1].

[1] <http://developer.chrome.com/extensions/history.html>

~~~
arikrak
If it would let people search there history and bookmarks, they could start
benefiting from it right away.

------
StavrosK
Hmm, this is similar to <http://historio.us>, which I built. However, this
doesn't require any user interaction, which might work well.

Do you store just the URL and depend on Google returning the results? How does
it work exactly?

~~~
andy_boot
I remember thinking that <http://historio.us> was a neat idea.

But Seen Before requires less effort on my part as a user -> I am more likely
to use it. I just continue to google as per normal and now I have an extra
option on the right to filter results.

~~~
StavrosK
This is true. The use cases are a bit different, but I still don't know
exactly how this works so I can't say.

------
adaml_623
"SeenBefore stores your information securely in the cloud from your work or
home computers.

So no matter where you read it you can still search for it even when your
browsing history has been deleted."

Erm...

I think it's a good idea but I think many people would need convincing on the
security front.

------
nuttendorfer
Who am I handing my data over to? I can't find this anywhere on the site.

~~~
chuppo
And would it be possible to configure it to use my own "cloud"?

~~~
vinnyglennon
Definitely something we are looking into. Major barrier is the cost for
someone keeping a server running 24*7 in cloud(Micro instance on AWS is 175
dollars a year).

~~~
teach
Some of us already have servers running 24-7 in the cloud. I have two, for
example.

------
tungwaiyip
Yes, I have seen it before! I have build a personal search engine MindRetrieve
back in 2005.

<http://mindretrieve.net>

Specifically I'm not comfortable for big web company to keep the history of my
web activity. So I make it work completely locally. My project did not get
much uptake, probably my lackluster marketing and other assorted issues are to
blame. So good luck on this one!

~~~
skinnymuch
Too bad. Looks like a really cool project. Even more impressive when seeing
how old it is.

------
vinnyglennon
Co-founder here. This took us by surprise, we were planning to have Firefox
and Safari support done by launch. At this stage, it is priceless to know if
we are solving a real problem people have. Also, is this something people
would pay for (loops back to if this is enough of a pain point). From the
moment we start charging, is the moment we start learning.

~~~
vidarh
See comment elsewhere: With tagging or (simple plain text) notes attached,
absolutely. Even moreso with a simple API and/or support to push the cached
content to my own server. If it could be selectively enabled for private
content too, then even better (e.g. there's several extensive private Wiki's I
use regularly that are not sensitive enough that I'd worry about getting them
indexed, and I'd love to be able to tell you to index them but perhaps disable
the caching).

------
ilija139
"40% of searches online are people simply looking for what they have already
seen before." - How did they calculate this statistic?

~~~
felipeko
"According to Yahoo, 40% of searches are simply searching for what you saw
before."

<https://www.seenbefore.com/pages/faq#currently_do>

They should link to the study.

~~~
vinnyglennon
Linked: <https://www.seenbefore.com/pages/faq#currently_do> . Thank you so
much! :)

------
akldfgj
I get that deployment is easier when it is vendor hosted, but this really
should be a local app using local storage, withe maybe transient server-side
storage for syncing between machiens.

------
vdm
Dup of Archify?

<https://www.archify.com/>

> 40% of searches online are people simply looking for what they have already
> seen before.

Citation link needed.

~~~
vinnyglennon
Citation link: <http://cond.org/sigir07.pdf> [PDF]

Information Re-Retrieval: Repeat Queries in Yahoo’s Logs

Abstract: "This paper explores repeat search behavior through the analysis of
a one-year Web query log of 114 anonymous users and a separate controlled
survey of an additional 119 volunteers. Our study demonstrates that as many as
40% of all queries are re-finding queries. Re-finding appears to be an
important behavior for search engines to explicitly support, and we explore
how this can be done."

~~~
lifeisstillgood
Wow, does 240 people even count as a sample. At Yahoo and Google log sizes its
probably the error from cosmic rays in the data center.

~~~
freshhawk
If they selected them in a properly random way and had an effect close to 40%
then yes, that probably does count as a sample.

~~~
lifeisstillgood
As someone who signed up to coursera stats 101, err... Why 40%?

~~~
freshhawk
I am making some assumptions here absolutely, but because 40% is a large
effect you don't need as many samples to be confident.

The other way of looking at it is that maybe it's actually 35% or 45% but
either way, that's still interesting, even with a rougher approximation of the
actual "answer". If, for some reason, you needed to know if it was 40% or
40.01% because that mattered to you then you _would_ absolutely be annoyed at
the small sample size.

If the finding was 2% then we _would_ care about the uncertainty of +/- 5%
since the finding is dwarfed by the error rate. That's a smaller effect size
so you would need more samples to separate reality from the noise.

I am, by the way, pulling all of these numbers out my ass. Your stats 101
class will teach you the formulas to calculate the actual error bars at work
here as well as the assumptions you need to make about the distribution of the
data to use those formulas.

------
Osiris
I think this is a great idea. I've been using Opera, which has a full-text
search capability for history, but it's limited to the machine you're using it
on.

I often find interesting articles on Hacker News while I'm at home that I want
to find again when I'm at work. Being able to search by browser history across
machines is fantastic for me.

------
lucaspiller
YES! I've been looking for something like this for ages for stuff I have read
on Hacker News.

~~~
pbhjpbhj
I use a system adapted from <http://www.gwern.net/Archiving%20URLs> to archive
every page I've bookmarked (using FF) in the previous month. Then I just query
with local tools.

Not ideal, several flaws, but works well enough for me so far.

------
elviejo
I'll take it for a spin... this is something I've wanted for a long time.

I was going to hack it by making chrome bookmark every site I visit with a
tag:history then when I wanted to search for a site that I've already visited
I was going to just search with that tag.

------
espeed
Doesn't Google already have this? Go to...

Show Search Tools -> All Results -> Visited Pages

~~~
eli
I'm pretty sure that's only for filtering pages visited _via a prior google
search_

In theory Chrome lets you search through your history for pages, but it
doesn't seem to actually work very well for me.

------
krassif
Similar to my project Peerbelt.com. A notable difference is Peerbelt runs
entirely on the client to void privacy concerns. Vinny, let's chat and see if
we can collaborate. Cheers, -Krassimir the Peerbelt founder

------
bbrian
There's also weekly reports that tell you what sites you've been visiting the
most, what time of day/what days you visit sites most, and how many pages
SeenBefore added to your file.

------
lesterbuck
In a similar vein, Pinboard offers to snapshot and full text index all your
bookmarks, for a small annual fee:

<https://pinboard.in/upgrade/>

------
martythemaniak
I'm going to give it a spin and let know what I think (it'll take a few weeks
of usage), but I can tell you right now that it's definitely solving a real
problem I have.

------
cnlwsu
I am going to try this out because it seems like what I spend a large portion
of my time doing. The security and privacy of this scares me a lot though.

------
ThomPete
I love this idea but I think you will find more traction by turning it into a
kind of bookmarking app with less focus on the search engine part.

------
pcl
I love the date visualization. This is something I think that pretty much all
search results could benefit tremendously from.

------
adambyrtek
How is this different from Google Search History?

<https://history.google.com>

~~~
gingerjoos
GSH searches only within your Google search history. This guy searches through
your entire browser(s) history.

~~~
adambyrtek
Thanks, I was confused by the fact that this integrates with Google Search.

------
willegan
Great idea, I assume with Chrome's new incognito browsing, this won't be
picked up on seenbefore or am I wrong?

------
coenhyde
Looks useful. I've been well aware that google tracks everything I search for
but I still don't like it.

------
webwanderings
I don't know why I should give you my browser history. I'd like to keep it to
myself.

~~~
pilooch
I agree, sounds like a crazy thing to do when this could easily be achieved
locally on my machine. Or am I missing something ?

~~~
freshhawk
But if a small piece of software was installed on your machine it wouldn't be
"in the cloud". We know that makes everything better. Ok well not application
performance ... or cost ... or usability ... but still, "the cloud".

------
mburns
For those using Firefox, there is a similar add-on called RecallMonkey.

[https://addons.mozilla.org/en-us/firefox/addon/prospector-
re...](https://addons.mozilla.org/en-us/firefox/addon/prospector-recall-
monkey/)

------
alanorourke
Great idea. Love it.

------
patriciaorgan
great idea and works brilliant!

