
Bookmark Archives That Don't - stilist
http://pinboard.in/blog/153/
======
_grrr
I'd be interested to know how you fully resolve external dependencies. For
example - do you pull in js libraries that are linked to dynamically within
other js files (as opposed to those that are simply referenced statically as
includes in the html)? If so, are you rendering the page in a 'headless
browser' to do this?

By the way, I think the idea for the service is great, although a little too
pricey for me to start using yet ;-) I always need to search my bookmarks. As
a proxy for doing this, I currently use Google's "search visited pages"
feature: when you're logged in to Google and search, you now get the option to
constrain the search to only those pages that you have visited in the past - a
superset of bookmarks, but useful nonetheless.

~~~
riffraff
connected to this: a lot of content in pages is often pulled in via js. For
example, facebook's page as seen via links is basically a long list of script
tags without any content.

Without javascript evaluation it seems that a lot of content would be lost.

~~~
baddox
Based on my usage of Firebug I am under the impression that even if stuff was
put onto the page with js (or if the html source is badly broken), the browser
will build a valid DOM structure of the page and should be able to save it out
as html/css.

------
hartbren
Surely the service is exposed to copyright claims. If the developers/business
owners are reading, I would be interested to hear about what issues have
arisen so far.

~~~
idlewords
None. Cached links are only visible to the user who saved them, and that seems
to do the trick.

------
_debug_
What am I missing : why not just use Firefox ScrapBook and automate regular
backups and sync-ing with your laptop (which is what I do right now)?

~~~
gwern
Or have an SQLite script which pulls URLs out of your Firefox history and
archives them using <http://webcitation.org/> ? (What I am currently doing.)

------
pclark
I bought this functionality but have honestly never found the need for it; it
kind of reminds me of the dropbox "pack rat" feature.

Happy to support a great developer (in both cases, actually)

------
RexRollman
I've tried making PDFs of interesting webpages before, for personal archiving,
using various PDF printers for Windows. I always ended up with files that
looked weird and not at all like the webpage.

I eventually found a FF extension that would save pages perfectly as a JPG or
PNG but then the text was no longer selectable or searchable.

~~~
laktek
One should build a service that will save the page as an image while keeping
textual content of the page separately for searching.

~~~
mixmax
Maybe it would even be possible to reproduce these webpages using HTML.

------
wladimir
I suppose a big part of the storage problem he talks about is solved by
agressively looking for duplicate and similar files over users. I mean, it is
a given that a lot of people will be bookmarking the same sites.

~~~
mikeklaas
Yes, but they will be also bookmarking a lot of stuff that only they bookmark.

In our dataset of over a billion bookmarks, 80% of urls were only bookmarked
by a single user. These urls comprise about 50% of all bookmarks (user-
document pairs).

Incidentally, worio.com (my startup) offers full-text search of your bookmarks
(though not a viewable cached copy, like these services).

~~~
bruceboughton
I'd bet the %age of external resources referenced by bookmarked pages that
were unique to a user would be a lot lower... which is where the aggressive
de-duplication would come in handy.

------
earl
I use zotero to save full pages; I like it. It's software that runs locally in
ff.

If someone is interested, I'd love to talk about some improvements -- you
could start with 1 - remove ads

2 - create a single file website archive format like the one that ie used
(uses?) It's stupid to litter your file system with js and images when all you
want is a file that includes the html and all dependencies

3 - deduplication w/ single save

4 - better search

5 - better compression

6 - site tracking

