
Fetching.io: Search the full text of every web page you visit - tekacs
http://fetching.io/
======
flippyhead
Author here. Thanks so much for the interest! It's funny, I've been hacking
away at a major update that I plan to release in the next few days... Was
assuming I'd stay under the radar until then. The update adds a ton of
features and improves the UI quite a bit ;)

To respond to some of your ideas feedback:

* I do plan to sell the service by some combination of charging to buy the native version and a monthly/yearly fee for the cloud version.

* By offering the native version I hope to assuage any privacy or legacy concerns -- all you data is on your machine (encrypted and backed up however you see fit). You'll even have access to a local API to extract or do whatever you want with it.

* One idea I've had is to offer a cloud version / native version combo. You would sync to the cloud only your bookmarked sites -- all the other indexed pages you visit would stay on the local version. This way you control what gets put up on the servers but can still have access to your links from all your devices. Thoughts?

* I'd also consider open sourcing it (it's built on Meteor and ElasticSearch) but really do need to get paid for my efforts (just had a baby) and am not familiar with all the ins/outs of open source based businesses. I'd love to hear ideas and advice!

* This has turned out to be quite a lot more difficult than I'd thought but I'm real happy with how things are coming along. Two words: ElasticSearch Rocks.

* Very embarrassed about the privacy policy link. Fixed now. ;)

~~~
teabee89
I'm impressed with your work, I always wanted to have something like this, and
was about to start coding it!

Here's my feedback: I do want the native version for privacy concerns, but I
also want the syncing. Why not offer a program (or Docker container?) that I
could put on _my_ cloud of choice? That would be the real freedom. If people
don't want to hassle with it, they will just pay _your_ cloud offering.

I really value products that pay attention to this 'detail'.

~~~
MildlySerious
This is exactly what I wished Workflowy or Thinkery would let me do. I love
the mechanics of those two services, but I barely use them because I do not
have control over my data. Thinkery basically turned into a bookmarking
service for me because of that.

So yeah, that's what I would pay for in this case as well.

------
juretriglav
Good stuff, there's definitely a value in "local search", where local stands
for "your own stuff". Pinboard can be used in a similar way (the paid
version), but the difference is that's FTS only for pinned things.

Shameless plug, I'm attempting to do something similar specifically for
science, but make those local results also available globally using a
distributed network based on WebRTC. It's also a browser extension, which
detects if you're on a page of a scientific article. If you are, it takes the
body of the article and indexes it, by putting its contents into a DHT. You
can then use the extension to search through this distributed network. For
those interested, the post back from June is available here:
[http://juretriglav.si/an-open-distributed-search-engine-
for-...](http://juretriglav.si/an-open-distributed-search-engine-for-science/)
with the source code here:
[https://github.com/ScholarNinja/extension](https://github.com/ScholarNinja/extension)
The project will get a lot more love soon, as it turned out it was a bit too
early back then because WebRTC implementations were buggy (since fixed in
Chrome, but e.g. it resulted in 100% CPU usage in Chrome after a short while,
gigabytes of memory used).

Anyway, best of luck making Fetching.io sustainable, flippyhead!

------
natch
Nice.

I've been told (by slingbox folks some time ago) that EFF argues that
automatic updates are never (edit: generally not) a good idea. They can be
used to add or remove functionality by court order.

Another scenario where automatic updates of a native app hurt users is when
your company is purchased by a larger company who then shuts down the product.
Please reconsider that feature for the localhost version.

BTW I'm just going by the green checkbox in the features comparison table to
conclude that you have this ill-advised feature.

Sorry for latching on to that one thing, but it's important imho.

Other than that, this is something I've wished for many times, so great to see
it becoming real. I loved clamprecht's suggestion of backfill from history --
that would be great!

------
srean
It has not come up on the comments yet so thought I should mention historious
[http://historio.us/](http://historio.us/) Its by an HN old timer

[https://hn.algolia.com/?q=historious#!/story/forever/0/histo...](https://hn.algolia.com/?q=historious#!/story/forever/0/historious)

BTW I am not affiliated in any form.

~~~
StavrosK
Hey, that's mine! Thanks for the mention!

------
dantiberian
This is the kind of product that would really benefit from having a clear
business model up front. Free + some promise of charging in the future doesn't
encourage me that it will be sustainable in it's current form.

Without a clear alternative, the likely conclusion is that user data will be
used for advertising some time in the future.

------
avinashv
Very interesting. I can't get the extension to work on Safari, though it works
fine on Chrome. On Safari it logs in, but the search doesn't work (typing "f
<something>" just goes to Google to search for "f <something>" every time, and
when I restart the browser, I'm logged out.) Twitter authentication is also
busted (returns a 500 error).

When it works, it's fast, clean, and really well integrated into the workflow
of my browsing, since I use the address bar to control basically everything.

If you can figure out the Safari issue, I'd happily pay a few bucks a month
for the cloud version.

Quick edit: turns out the Safari extension is definitely indexing the
browsing, just the keyword search shows issues. Restarting the browser also
kills the authentication every time. Latest Safari on OS X 10.10, if it helps.

------
msandford
I looked at doing this a couple of years ago but with the a few differences:

1\. Only fetched things available publicly

2\. Was going to charge $5/mo

The rationale behind only fetching public things was to avoid indexing
people's banking records or other sensitive information.

The $5/mo was because I wasn't looking for venture funding and I wanted to get
paid.

Ultimately I gave up on the idea once it started to get difficult to
implement. Probably my biggest failing; I'm easily distracted.

I hope these folks do something to address privacy concerns and make their
business sustainable.

------
p4bl0
So this would save every page I visit, including when I'm logged in to private
services (email, bank, etc), to somewhere in the cloud I can't control?

Is there any client-side encryption done? If so, where is the publicly
auditable code? And how does the search work? It fetches everything and
decrypt it for each single query?

The idea is very good, but this should not be done in the cloud, it has to be
done locally, and potentially securely synchronized among different machines.

EDIT: Okay it's not mentioned on the landing page, but there is an option to
use it locally. Cool!

EDIT2: hey downvoters, when I read "Your cloud data is visible only to you.
You can optionally install fetching as an application on your computer." I
assumed that the app was a client for the cloud service that was distinct from
the web interface usable in a browser. This is a totally legit interpretation,
especially when the next title is "It's accessible from anywhere". I don't see
why it is wrong in that case to raise the privacy concerns that I mentioned. I
cared enough to continue investigating and found on an other page that the
product can actually be used locally. I then edited my comment (maybe 4 or 5
minutes later) in accordance with that new knowledge. Knowing that the concern
I raised are still valid for users who would chose to use it with the cloud,
what does your downvotes mean?

------
seanp2k2
This is super cool and I've been thinking about building something like this
for a long time. One thing to also consider: timing data. In my thinking about
it, I thought that it would be massively useful to record both the exact time
and order pages were visited in, and the other tabs open at that time. Lastly,
check out AlchemyAPI to auto-suggest some tags / keywords based on page
content to make recall easier.

Basically this + all of the above + a browsable timeline interface was my
idea. Please take it and build it if you have the time / motivation, and I'll
subscribe to your service (or help hack on it if you open-source it). The best
competitor I've found so far is Pocket with the premium features (I'm a
subscriber).

Good luck, and great work. I also love meteor and ES :)

~~~
baddox
Though it's not particularly practical, storing an event for every tab and
window open, close, and URL change could lead to some very cool analytics
after a long period of usage. I'd love to see visualizations of the number of
tabs I have open over time, number of tabs per domain, etc.

------
idibidiart
This is the one I made a while ago that does the same thing, maybe a little
differently ...

[https://chrome.google.com/webstore/detail/all-seeing-
eye/kio...](https://chrome.google.com/webstore/detail/all-seeing-
eye/kiopjipnmfcpdambegpfmggaffjmhnkd)

~~~
pharshal
I just tried your extension on chrome. Works pretty neatly. I was wondering
where do you store the index for full text search so that I can keep it in my
dropbox and give it a 'cloud' like feature, where I can use the same dropbox
location from a different device.

------
userbinator
It seems like this functionality could be pretty easy to achieve (minus the
"cloud" part) with any existing browser that caches to the local filesystem -
just set the cache limits to "unlimited" so it'll continue accumulating pages,
and let your OS's search function take care of the rest. If you want to be
fancy and keep only the text, add a script that periodically runs to clear out
images, CSS, JS, and other cached files you don't need.

One of the biggest problems I can see is with the increasing popularity of web
apps that load as a single page and use JS to load/parse/display the data;
only the browser can get the actual content in that case.

~~~
logn
A browser addon could work. I've toyed around in firefox. You can access the
rendered dom in addons and do whatever you want with it, including logging it
to file. Here's an example of writing to file:
[https://github.com/prekageo/http-request-
logger/blob/master/...](https://github.com/prekageo/http-request-
logger/blob/master/components/httpRequestLogger.js)

And you can observe all the requests/responses and wait for an http 200 for
the entire page (excluding intermediate 200's for things such as images).
Example:
[https://github.com/MachinePublishers/ScreenSlicer/blob/maste...](https://github.com/MachinePublishers/ScreenSlicer/blob/master/core/firefox-
addon/lib/main.js)

The best approach for this would probably be to tie into a page unload event
or some hybrid approach.

Getting started: [https://developer.mozilla.org/en-US/Add-
ons/SDK](https://developer.mozilla.org/en-US/Add-ons/SDK)

------
pbnjay
There are so many times when I've remembered a random sentence I read and
wanted to find the article to quote again later. This is perfect!

Like others here, I'm very curious about your business model, especially since
this is closed source.

~~~
pbnjay
um. I guess I retract my "closed source" comment and amend it to "not open
source". 234M .app seems pretty heavyweight - do you really need all these
node packages in the bundle?

If I'm running localhost, why do I need to create an account?

------
dools
I made a quick POC for something like this but with twitter:

[http://www.birdmine.com/](http://www.birdmine.com/)

Interestingly I created it initially to deal with the anxiety of not being
able to read all the great content available on the net.

Then the service broke (kind of) and I realised I didn't care that I couldn't
search all the stuff I tweeted.

So even though it's kind of broken it still solved my problem :)

I have been meaning to fix it up and get it operational.... One of these
days....

------
linguica
So this is sort of like the late Google Desktop? For a long time I have been
annoyed that with the oodles of space and bandwidth we have nowadays,
something as conceptually simple as "a searchable history of everything you
look at online" doesn't really exist anymore.

------
simi_
What a delightful and extremely useful piece of software! I had a plan to
build a smart bookmarking service (like pinboard on steroids), but I think
this is what I was actually thinking of. Great work!

PS: If you ever decide to open source it I'd be happy to contribute.

------
serf
I wish I could host my own cloud, but I understand the model. I just wish it
wasn't so :(.

~~~
justcommenting
it's getting easier every day; check out the free and open-source owncloud

------
mitchtbaum
Polipo can already store full text of every visited web page efficiently. For
searches, we'd need a way to grep its compressed cache. Web UI optional.
(Cache versioning would be a nice, separate feature.)

------
walterbell
Very cool. Will it work with mobile Safari which doesn't support browser
extension/plugin?

Could the index data be added to
[http://commoncrawl.org](http://commoncrawl.org)?

------
capisce
Tried to create a new account via Twitter and got a "Internal server error
[500]". The concept sounds useful though.

~~~
oneeyedpigeon
I also got an internal server error when creating a full (non-twitter, non-
facebook) account, but it _had_ actually registered me, and I could login with
my details.

------
mrmondo
Great idea, so please you added a host-it-yourself / non-cloud option!

Can't wait for firefox support, then I can start using it.

------
undef1ned
Reminds me [https://www.archify.com/](https://www.archify.com/)

------
ConSeannery
"Privacy" link is broken. Is there client side encryption, or can fetching
decrypt the data?

------
baby
Why limit it to web pages? You could be the Google of local search.

------
shirman
Is it possible to run it on private VPS?

------
wudf
that's pretty cool. we made something similar at an HP IDOL hackathon but it
was focused on social

------
justcommenting
this looks awesome--i'm excited for the linux version

~~~
sheltgor
Wish a Windows version was incoming as well. Alas, I'll have to make do with
Googling various phrases whenever I want to try and find a webpage I recall
seeing and want to find again, but can't remember the name of.

~~~
justcommenting
for me, a privacy-respecting free and open-source product built to be
accessible via tor browser (i.e. usable without the browser plugin) that
accepts bitcoin donations could be a great alternative to similar
functionality in e.g. the paid version of pinboard. replicating privacy-
enhanced workflows from pinboard/evernote and the ilk can be frustrating at
times, and trying to host my own bookmarks as a hidden service seems like
overkill.

