Hacker News new | past | comments | ask | show | jobs | submit login
Fetching.io: Search the full text of every web page you visit (fetching.io)
190 points by tekacs on Oct 31, 2014 | hide | past | web | favorite | 52 comments



Author here. Thanks so much for the interest! It's funny, I've been hacking away at a major update that I plan to release in the next few days... Was assuming I'd stay under the radar until then. The update adds a ton of features and improves the UI quite a bit ;)

To respond to some of your ideas feedback:

* I do plan to sell the service by some combination of charging to buy the native version and a monthly/yearly fee for the cloud version.

* By offering the native version I hope to assuage any privacy or legacy concerns -- all you data is on your machine (encrypted and backed up however you see fit). You'll even have access to a local API to extract or do whatever you want with it.

* One idea I've had is to offer a cloud version / native version combo. You would sync to the cloud only your bookmarked sites -- all the other indexed pages you visit would stay on the local version. This way you control what gets put up on the servers but can still have access to your links from all your devices. Thoughts?

* I'd also consider open sourcing it (it's built on Meteor and ElasticSearch) but really do need to get paid for my efforts (just had a baby) and am not familiar with all the ins/outs of open source based businesses. I'd love to hear ideas and advice!

* This has turned out to be quite a lot more difficult than I'd thought but I'm real happy with how things are coming along. Two words: ElasticSearch Rocks.

* Very embarrassed about the privacy policy link. Fixed now. ;)


This looks great, I've wanted this for a long time. One idea: Maybe have an option to "backfill" based on your browser's history? Seems like a good way to give users instant gratification, and solves the problem of "I wish I'd started running this a year ago". In reality, I wish I had it for every page I've visited in the last 14 years.

Update: Another idea - maybe you could integrate with pinboard or delicious.com (if anyone still uses it) to backfill-index all links saved to those services. Maybe this could be a premium feature.


I'm impressed with your work, I always wanted to have something like this, and was about to start coding it!

Here's my feedback: I do want the native version for privacy concerns, but I also want the syncing. Why not offer a program (or Docker container?) that I could put on my cloud of choice? That would be the real freedom. If people don't want to hassle with it, they will just pay your cloud offering.

I really value products that pay attention to this 'detail'.


This is exactly what I wished Workflowy or Thinkery would let me do. I love the mechanics of those two services, but I barely use them because I do not have control over my data. Thinkery basically turned into a bookmarking service for me because of that.

So yeah, that's what I would pay for in this case as well.


+1 Docker or VM.


I'd love to support you in some way if you decide to open source this.

I'm otherwise a bit paranoid to let all the text of every homepage I visit be captured by a closed source plug in. Still - amazing job, and technically very, very impressive.


Hi, The local install option of this looks very interesting to me. I'd like to connect by email and give some feedback. Happy to buy as I'm looking for something that isn't online only.

I'm a regular user of Diigo, about 10K links, 500 different tags. Don't like it being cloud only.

My favourite feature is my ability to annotate a link (mostly highlighting text), so it effectively creates a chronological and topical feed of the exact sentences of what I want to remember from a link. It's kind of a self writing blog of what I read and experienced, complete with what stood out to me, and any notes I wanted to make.

I find I more remember a point or a sentence from a link than the link itself, and having a full text search of the words I remember highlighting and saving is incredibly powerful. I actually end up revisiting those links.

I have some experience with research and filing large databases of articles and images at a job in another life.

Look forward to chatting :)


Very cool, and something I think would be great to have. To add to the questions, have you thought about a version that uses or can interface with owncloud (or something similar)? I think the cross-device capability would be great, but I would personally be more inclined to use it if I could keep the data on servers I control.


Hi, thanks for this incredible useful tool. actually i kind of working a similar chrome plugin. Now i dont have to :)

does it index existing bookmarks? seems its not doing it now. The reason i wanted to build this is because my bookmarks are grown toooo big. And i wanted a way to search.

Please add this feature. and indexing the history too if possible

Cheers


+1 for this feature!


Quick bug report on the cloud extension in Chrome, probably the others as well. The inputted email address for login is case sensitive, and the initial registration converts any inputted email address to lower-case. It took me a good while to figure out why I couldn't login.


Congrats!

This product is amazing, had something like this in mind for a long and couldn't find a proper implementation.

I would be glad to pay reasonable price for such service.

Localhost/native version is a killer feature. Don't drop it! If you open source the code I'll be glad to contribute...


Congrats, this looks great and promising.

I am wondering how localhost works. When Linux version comes out, will it support storing database on a remote host? In other words, using own virtual host as a server.


I would love the ability to put in as input my delicious bookmarks or Pocket and have the ability to search just those inputs.


Looks great so far :) Any plans to add a Firefox extension any time soon?


i've also wanted this for a loooooooooong time.

btw will it have a way to export the data also ?

i'am ready to pay for this service

keep me updated for the linux version :)

bussiere AT gmail.com

Regards


Good stuff, there's definitely a value in "local search", where local stands for "your own stuff". Pinboard can be used in a similar way (the paid version), but the difference is that's FTS only for pinned things.

Shameless plug, I'm attempting to do something similar specifically for science, but make those local results also available globally using a distributed network based on WebRTC. It's also a browser extension, which detects if you're on a page of a scientific article. If you are, it takes the body of the article and indexes it, by putting its contents into a DHT. You can then use the extension to search through this distributed network. For those interested, the post back from June is available here: http://juretriglav.si/an-open-distributed-search-engine-for-... with the source code here: https://github.com/ScholarNinja/extension The project will get a lot more love soon, as it turned out it was a bit too early back then because WebRTC implementations were buggy (since fixed in Chrome, but e.g. it resulted in 100% CPU usage in Chrome after a short while, gigabytes of memory used).

Anyway, best of luck making Fetching.io sustainable, flippyhead!


Nice.

I've been told (by slingbox folks some time ago) that EFF argues that automatic updates are never (edit: generally not) a good idea. They can be used to add or remove functionality by court order.

Another scenario where automatic updates of a native app hurt users is when your company is purchased by a larger company who then shuts down the product. Please reconsider that feature for the localhost version.

BTW I'm just going by the green checkbox in the features comparison table to conclude that you have this ill-advised feature.

Sorry for latching on to that one thing, but it's important imho.

Other than that, this is something I've wished for many times, so great to see it becoming real. I loved clamprecht's suggestion of backfill from history -- that would be great!


It has not come up on the comments yet so thought I should mention historious http://historio.us/ Its by an HN old timer

https://hn.algolia.com/?q=historious#!/story/forever/0/histo...

BTW I am not affiliated in any form.


Hey, that's mine! Thanks for the mention!


This is the kind of product that would really benefit from having a clear business model up front. Free + some promise of charging in the future doesn't encourage me that it will be sustainable in it's current form.

Without a clear alternative, the likely conclusion is that user data will be used for advertising some time in the future.


Very interesting. I can't get the extension to work on Safari, though it works fine on Chrome. On Safari it logs in, but the search doesn't work (typing "f <something>" just goes to Google to search for "f <something>" every time, and when I restart the browser, I'm logged out.) Twitter authentication is also busted (returns a 500 error).

When it works, it's fast, clean, and really well integrated into the workflow of my browsing, since I use the address bar to control basically everything.

If you can figure out the Safari issue, I'd happily pay a few bucks a month for the cloud version.

Quick edit: turns out the Safari extension is definitely indexing the browsing, just the keyword search shows issues. Restarting the browser also kills the authentication every time. Latest Safari on OS X 10.10, if it helps.


I looked at doing this a couple of years ago but with the a few differences:

1. Only fetched things available publicly

2. Was going to charge $5/mo

The rationale behind only fetching public things was to avoid indexing people's banking records or other sensitive information.

The $5/mo was because I wasn't looking for venture funding and I wanted to get paid.

Ultimately I gave up on the idea once it started to get difficult to implement. Probably my biggest failing; I'm easily distracted.

I hope these folks do something to address privacy concerns and make their business sustainable.


So this would save every page I visit, including when I'm logged in to private services (email, bank, etc), to somewhere in the cloud I can't control?

Is there any client-side encryption done? If so, where is the publicly auditable code? And how does the search work? It fetches everything and decrypt it for each single query?

The idea is very good, but this should not be done in the cloud, it has to be done locally, and potentially securely synchronized among different machines.

EDIT: Okay it's not mentioned on the landing page, but there is an option to use it locally. Cool!

EDIT2: hey downvoters, when I read "Your cloud data is visible only to you. You can optionally install fetching as an application on your computer." I assumed that the app was a client for the cloud service that was distinct from the web interface usable in a browser. This is a totally legit interpretation, especially when the next title is "It's accessible from anywhere". I don't see why it is wrong in that case to raise the privacy concerns that I mentioned. I cared enough to continue investigating and found on an other page that the product can actually be used locally. I then edited my comment (maybe 4 or 5 minutes later) in accordance with that new knowledge. Knowing that the concern I raised are still valid for users who would chose to use it with the cloud, what does your downvotes mean?


This is super cool and I've been thinking about building something like this for a long time. One thing to also consider: timing data. In my thinking about it, I thought that it would be massively useful to record both the exact time and order pages were visited in, and the other tabs open at that time. Lastly, check out AlchemyAPI to auto-suggest some tags / keywords based on page content to make recall easier.

Basically this + all of the above + a browsable timeline interface was my idea. Please take it and build it if you have the time / motivation, and I'll subscribe to your service (or help hack on it if you open-source it). The best competitor I've found so far is Pocket with the premium features (I'm a subscriber).

Good luck, and great work. I also love meteor and ES :)


Though it's not particularly practical, storing an event for every tab and window open, close, and URL change could lead to some very cool analytics after a long period of usage. I'd love to see visualizations of the number of tabs I have open over time, number of tabs per domain, etc.


This is the one I made a while ago that does the same thing, maybe a little differently ...

https://chrome.google.com/webstore/detail/all-seeing-eye/kio...


I just tried your extension on chrome. Works pretty neatly. I was wondering where do you store the index for full text search so that I can keep it in my dropbox and give it a 'cloud' like feature, where I can use the same dropbox location from a different device.


It seems like this functionality could be pretty easy to achieve (minus the "cloud" part) with any existing browser that caches to the local filesystem - just set the cache limits to "unlimited" so it'll continue accumulating pages, and let your OS's search function take care of the rest. If you want to be fancy and keep only the text, add a script that periodically runs to clear out images, CSS, JS, and other cached files you don't need.

One of the biggest problems I can see is with the increasing popularity of web apps that load as a single page and use JS to load/parse/display the data; only the browser can get the actual content in that case.


A browser addon could work. I've toyed around in firefox. You can access the rendered dom in addons and do whatever you want with it, including logging it to file. Here's an example of writing to file: https://github.com/prekageo/http-request-logger/blob/master/...

And you can observe all the requests/responses and wait for an http 200 for the entire page (excluding intermediate 200's for things such as images). Example: https://github.com/MachinePublishers/ScreenSlicer/blob/maste...

The best approach for this would probably be to tie into a page unload event or some hybrid approach.

Getting started: https://developer.mozilla.org/en-US/Add-ons/SDK


I (as have many others) considered building something like this in the past. What you say was exactly what put me off.

The difference in CPU time between downloading a page and rendering it (even virtually as with say PhantomJS) was sufficiently large that running in the browser (not centrally) seemed to be the only general purpose way and maintaining _browser_ extensions is... a pretty major job. I was looking into writing a daemon to externally monitor the browser and it's cache (such as Chrome's Current Session file) when I left off.

Hopefully the use of server-side prerendering will catch on, be it through Node or other systems...


Safari on OS X already does this - you can do a Spotlight search for text on any page in your browser history or in your bookmarks and it'll show up


There are so many times when I've remembered a random sentence I read and wanted to find the article to quote again later. This is perfect!

Like others here, I'm very curious about your business model, especially since this is closed source.


um. I guess I retract my "closed source" comment and amend it to "not open source". 234M .app seems pretty heavyweight - do you really need all these node packages in the bundle?

If I'm running localhost, why do I need to create an account?


I made a quick POC for something like this but with twitter:

http://www.birdmine.com/

Interestingly I created it initially to deal with the anxiety of not being able to read all the great content available on the net.

Then the service broke (kind of) and I realised I didn't care that I couldn't search all the stuff I tweeted.

So even though it's kind of broken it still solved my problem :)

I have been meaning to fix it up and get it operational.... One of these days....


So this is sort of like the late Google Desktop? For a long time I have been annoyed that with the oodles of space and bandwidth we have nowadays, something as conceptually simple as "a searchable history of everything you look at online" doesn't really exist anymore.


What a delightful and extremely useful piece of software! I had a plan to build a smart bookmarking service (like pinboard on steroids), but I think this is what I was actually thinking of. Great work!

PS: If you ever decide to open source it I'd be happy to contribute.


I wish I could host my own cloud, but I understand the model. I just wish it wasn't so :(.


it's getting easier every day; check out the free and open-source owncloud


Polipo can already store full text of every visited web page efficiently. For searches, we'd need a way to grep its compressed cache. Web UI optional. (Cache versioning would be a nice, separate feature.)


Very cool. Will it work with mobile Safari which doesn't support browser extension/plugin?

Could the index data be added to http://commoncrawl.org?


Tried to create a new account via Twitter and got a "Internal server error [500]". The concept sounds useful though.


I also got an internal server error when creating a full (non-twitter, non-facebook) account, but it had actually registered me, and I could login with my details.


Great idea, so please you added a host-it-yourself / non-cloud option!

Can't wait for firefox support, then I can start using it.



"Privacy" link is broken. Is there client side encryption, or can fetching decrypt the data?


Why limit it to web pages? You could be the Google of local search.


Is it possible to run it on private VPS?


that's pretty cool. we made something similar at an HP IDOL hackathon but it was focused on social


this looks awesome--i'm excited for the linux version


Wish a Windows version was incoming as well. Alas, I'll have to make do with Googling various phrases whenever I want to try and find a webpage I recall seeing and want to find again, but can't remember the name of.


for me, a privacy-respecting free and open-source product built to be accessible via tor browser (i.e. usable without the browser plugin) that accepts bitcoin donations could be a great alternative to similar functionality in e.g. the paid version of pinboard. replicating privacy-enhanced workflows from pinboard/evernote and the ilk can be frustrating at times, and trying to host my own bookmarks as a hidden service seems like overkill.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: