
Web Clipper Browser Extension with Automatic Content Extraction, Now Open Source - laybak
https://github.com/jhlyeung/rumin-web-clipper
======
laybak
Hi HN, a few months ago I started building a suite of knowledge management
tools to for my own needs. It's been a long iterative process of noticing
patterns (and inefficiencies) in my workflow, building simple tools to improve
it, and evolving the UX over time.

One of the tools I have been using daily is a web clipper that captures not
just the current page, but can automatically extract key information from it.
You can also do a quick lookup of your existing notes regardless which web
page you are on.

Prior to this, I had been using web clipper extensions by Evernote, OneNote,
and Notion, and all of them had something missing that would significantly
slow me down. Wanted to share what I have built to address this. The code is
integrated with the [Rumin]([https://getrumin.com](https://getrumin.com))
backend (the other tools I built), but you can easily swap out the API calls
to point to local storage or some other endpoint.

Check it out. Would love to hear feedback from the community :)

~~~
neovive
Great project! Rumin looks very interesting as well. I was a long-time
Evernote Web Clipper user, but switched to Notion a few months ago. I'm much
happier with Notion's web clipping workflow and table storage approach, but
it's not perfect.

~~~
laybak
thanks neovive! Yeah Notion's web clipping and table storage approach is quite
elegant. It only gets a bit clumsy when we get to the "power user" (not very
common) use cases

------
karlicoss
Good work! For extracting meta information -- a set of community maintained
information scrapers (html, or intercepting ajax) for different websites could
be cool. It's hard to maintain all the sites on your own (especially the ones
you don't use), and by sharing we could perhaps avoid redoing the same thing
twice.

Your extension, of course, Wildcard
([https://news.ycombinator.com/item?id=22439141](https://news.ycombinator.com/item?id=22439141)),
youtube-dl, and possibly many other could benefit from it.

~~~
laybak
thanks! and yeah that's a very good point, and one of the main reasons why I'm
open sourcing this.

Community maintained information scrapers/extractors is definitely a direction
I want to build towards, collaborating with any existing efforts. Though the
exact form will take some iterations (e.g. a marketplace for
scripts/"recipes", built-in scripts for common sites, allowing individual
users to save their own scrapers etc.)

------
mark_l_watson
Nice, this looks extremely useful.

Years ago, I spent a couple of months building a simple EverNote clone in
Clojure. The weakest part of my “for my own use only” project was a FireFox
extension I wrote to capture selected web page data and send it to the backend
of my system.

This Web Clipper project would have really helped me. I hope the author of
this gets the satisfaction of wide adoption in many cool projects.

~~~
laybak
author here, this comment already made my day :) feel free to take the code
and run with it, and let me know if you have any suggestions/questions

------
kapnobatairza
This looks great! I use Evernote Web Clipper but spend a lot of time adding
context/information/screenshots manually, this would save me a ton of time. I
requested access to Rumin and will definitely try swapping this into my
workflow.

~~~
laybak
Thanks, kapnobatairza! working on cleaning up the product before opening up,
will reach out to ask you more about your use cases :)

------
tyingq
Looks pretty slick, though custom metadata for just 7 sites seems pretty low
for launch. Perhaps the default metadata capture is good enough for sites like
Wikipedia, Amazon, etc, that aren't covered?

~~~
laybak
thanks for checking it out! yeah the coverage for the metadata capture
definitely needs to be improved. At this point, this just includes the top
sites for my own use cases (and some early users).

I was hoping by sharing it I can get a better sense of what sites other people
would like to have supported, and keep adding to it :)

------
rektide
Built to assist a proprietary locked down thing, but heck yes

~~~
laybak
haha for now...the main reason being it's just me working on it at the moment,
and I'm fixing/cleaning things up before releasing more of the code base. the
rest of the product is pretty clunky (with a beyond shitty code base)

in the meantime, it should be easy to swap out the API hostname to something
else (or even local storage)

~~~
input_sh
Please pick a license. By not doing so, you retain full ownership of the code,
preventing other people from modifying it for their own needs. See here for
more details: [https://choosealicense.com/no-
permission/](https://choosealicense.com/no-permission/)

~~~
laybak
ohh thanks for the heads up! will look into this. Sorry, this is my first time
sharing an open source project.

EDIT: I've added an MIT license to the code. Thanks again for pointing out

------
wila
I was a huge fan of the old "Scrapbook" plugin. That one died as FireFox
switched their API's.

Looks like there's a new version WebScrapBook [1] based on that old code base
which is now available.

[1] [https://addons.mozilla.org/en-
US/firefox/addon/webscrapbook/](https://addons.mozilla.org/en-
US/firefox/addon/webscrapbook/)

------
lucasverra
Hi there; congrats on supporting OSS and Rumin .

I'm a active notion web cliper user. I trust Notion because they have plenty o
money to not lose my data.

Why should i use rumin ?

~~~
laybak
Thanks Lucas! I was a Notion web clipper user as well. For me it worked for
the most basic use case of saving a page into a table. But for me these use
cases kept coming up: \- An idea belongs to multiple collections, as opposed
to a single "table" \- There are usually properties/metadata I want to save
(e.g. YouTube channel information), which would take multiple copy-and-pastes
back and forth each time \- Bi-directional linking of captured content \- I
wanted full-control over the captured data, for more advanced
queries/filtering and it's sad that web clippers tend to be one of these
"table stakes" features that companies build a basic version for, and not
invest further in.

Quick answer for "Why should I use Rumin?" is: "Perhaps you shouldn't yet, but
let's stay in touch and I'd love to hear about your use cases and other
ideas."

The current version of Rumin is very rough, and there's an overwhelming list
of improvements to make. This is one reason why I closed it for sign ups for
now. But in the meantime, I feel there's a lot the community can do even with
just the web capture component being open source.

Regarding your concern about data loss. I intend to open source more and more
parts of the platform, and somehow figure out a model to make the development
sustainable.

Thanks again for checking it out!

------
Causality1
Ah, web clipping. I haven't heard that phrase used since PDAs were running on
the Mobitex network and had to use web clipping to usefully browse the
internet at all.

