
Show HN: Tesoro – Personal internet archive - agamble
https://tesoro.io
======
JackC
For personal web archiving, I highly recommend
[http://webrecorder.io](http://webrecorder.io). The site lets you download
archives in standard WARC format and play them back in an offline (Electron)
player. It's also open source and has a quick local setup via Docker -
[https://github.com/webrecorder/webrecorder](https://github.com/webrecorder/webrecorder)
.

Webrecorder is by a former Internet Archive engineer, Ilya Kreymer, who now
captures online performance art for an art museum. What he's doing with
capture and playback of Javascript, web video, streaming content, etc. is
state of the art as far as I know.

(Disclaimer - I use bits of Webrecorder for my own archive, perma.cc.)

For OP, I would say consider building on and contributing back to Webrecorder
-- or alternatively figure out what Webrecorder is good at and make sure
you're good at something different. It's a crazy hard problem to do well and
it's great to have more ideas in the mix.

~~~
motdiem
Seconding Webrecorder (and the newly updated WAIL) - I had the chance of
meeting Ilya Kremer at a conference a few weeks ago, and I can confirm what
he's doing is top notch - I'm hoping to see more work around WARC viewing and
sharing in the future.

(Disclaimer: I also do personal archiving stuff with getkumbu)

~~~
amrrs
Is offline playback still relevant in the age of ubiquitous always connected
Internet?

~~~
kchr
If your intention is to have a local archive of an online site, yes.

------
smoyer
It's not mine unless it's running on my own servers or computer - I created a
really rough version of this several years ago that is saved to my computer
(and from there into box).

~~~
pbhjpbhj
I adapted a bash script someone posted here, it uses Firefox bookmarks
(pages.sqlite). Cron runs the script and downloads every page I've bookmarked
that month (after some filtering). I don't use it often but sometimes I'll
awk-grep it; I'm a hoarder in real life too!

~~~
gkya
Would you mind posting it?

~~~
pbhjpbhj
See sibling comment.

------
Piskvorrr
That's just as much "my own" as The Internet Archive: a website Out There
somewhere. Worse, it's much more likely to rot and disappear than archive.org.
Now, if I could run this _locally_...

(Yes, yes, `wget --convert-links`, I know. Not quite as convenient, though.)

~~~
agamble
OP here. The internet archive is great, but it's not so awesome if there's
some ephemeral content you need to save right away, like Tweets or social
media posts. Being able to trigger an archive immediately let's you save
temporary content such as that which is more prone to deletion. I'm going to
build a Chrome extension to click and make cloud copy of the page you're on,
hopefully that will make it seem more personally controllable.

Do you think being able to download the archive locally would be useful?

~~~
dsacco
This might sound insane, but if you modified this into a browser extension
that runs locally (with options for one-off or continuous saving for entire
browsing sessions) I would probably download it. Personally, I have well over
100TB of personal hard drive space in my home, and I would love to just
download entire portions of my browsing history locally for archival reasons
(and to truly defeat link rot).

As it is now, I personally wouldn't use it (but it's a cool project,
definitely please keep working on this idea!).

~~~
chongli
_modified this into a browser extension_

I was just thinking about this last night while I was explaining my use of the
Firefox tab groups extension to a friend. I use bookmarks and tabs to keep
track of information. Neither is fully convenient and the whole system fails
whenever a page changes or a link rots.

I would love a system that archives a page I bookmark so that the bookmark
will always work to give me that information. Give me an 'ephemeral' checkbox
if I want my bookmark to change when the site changes. Hmmm.

~~~
rahiel
I made a browser extension [1] that automatically archives bookmarks to
archive.is or (currently Chromium only) locally as MHTML files.

[1]:
[https://github.com/rahiel/archiveror](https://github.com/rahiel/archiveror)

~~~
pdfernhout
Cool, Rahiel! Thanks for doing this.

Here is a related idea I proposed a couple years ago to a Knight News
Challenge on Libraries:
[https://web.archive.org/web/20161104175911/https://www.newsc...](https://web.archive.org/web/20161104175911/https://www.newschallenge.org/challenge/how-
might-libraries-serve-21st-century-information-needs/submissions/libraries-as-
distributed-digital-knowledge-repositories) "Create a browser addon so when
people post to the web they can send a copy for storage and hosting by a
network of local libraries. ... While the Internet Archive is backing up some
of the internet, it is another single point of failure. We propose developing
data standards, software applications, coordination protocols. and hardware
specifications so every local library in the world can participate in backing
up part of the internet. ..."

Sad that the Knight News Foundation has changed their software and so all the
old Knight News contributions are no longer available. It's an example of the
very thing that contribution was about -- the need for distributed backups.
Glad that info is still findable in archive.org -- until perhaps the Knight
News Foundation puts up a broad robots.txt and makes it all inaccessible.

Thanks again for creating a great plugin!

~~~
WhiteOwlLion
What about IPFS for storing cached pages?

~~~
pdfernhout
Great idea --thanks!
[https://en.wikipedia.org/wiki/InterPlanetary_File_System](https://en.wikipedia.org/wiki/InterPlanetary_File_System)

------
j_s
I would be interested in an attestation service that can provide court-
admissable evidence that a particular piece of content was publically
accessible on the web at a particular point in time via a particular url.

I believe the only way to incentivise participation in such a system is by
paying for timestamp'ed signatures, eg. "some subset of downloaded [content]
from [url] at [time] hashed to [hash]" all tucked into a Bitcoin transaction
or something. There are services that will do this with user-provided
content[1]; I am looking for something that will pull a url and timestamp the
content.

This would also be a way to detect when different users are being served
different content at the same url, thus the need for a global network of
validators.

[1] [https://proofofexistence.com/](https://proofofexistence.com/)

~~~
rjeli
Interesting - it is trivial to prove something was done today rather than
yesterday, by hashing with the most recent bitcoin block or some new info.

Is it possible to prove something was done in the past? All I can think of is
some sort of scheme involving destroyed information.

~~~
j_s
_trivial to prove something was done today_

My focus is on the _something_ much more so than the _when_. I can do my own
doctoring of any data, or use some service to make something that looks
real[1]. Getting some proof that this fake data existed is not what I'm after.

Instead, I want multiple, completely separate (and ideally as independent and
diverse as possible) attestations that _something_ was out there online, as
proof that some person or organization intended for it to be seen by everyone
as their content. Being able to prove that irrefutably seems nearly impossible
today even for the present time, particularly against insider threats.

Your question regarding proving something in the past is going far beyond what
I'm hoping for; it will take me quite a while to come up with anything that
might be helpful for such a situation. I assume most would hit up the various
archive sites, but my gut feeling is that it winds up being a probability
based on how well forensics holds up / are not falsifiable.

[1] simitator.com - not linking because ads felt a bit extra-sketch!

------
unicornporn
In what way could this considered to be “your own internet archive”? I see no
way to register a user and save pages to a collection.

If you really want to create _your own_ archive, set up a Live Archiving HTTP
Proxy[1], run SquidMan [2] or check out WWWOFFLE[3].

If you want something simpler, have a look at Webrecorder[4] or a paid
Pinboard account with the “Bookmark Archive”[5].

[1] [http://netpreserve.org/projects/live-archiving-http-
proxy/](http://netpreserve.org/projects/live-archiving-http-proxy/)

[2]
[http://squidman.net/squidman/index.html](http://squidman.net/squidman/index.html)

[3]
[http://www.gedanken.org.uk/software/wwwoffle/](http://www.gedanken.org.uk/software/wwwoffle/)

[4] [https://webrecorder.io/](https://webrecorder.io/)

[5] [https://pinboard.in/upgrade/](https://pinboard.in/upgrade/)

~~~
agamble
Great points.

You're right, for now it's a single rate-limited HTML form and you'll have to
manually collate the links to the archives you create. I'll be adding
specialty features (with accounts) next. :)

------
rahiel
An internet archive can only provide value if it's there for the long-term.
What's your plan to keep this service running if it gets popular? For example,
archive.is costed about $2000/month at the start of 2014 [1]. I expect it to
cost even more now.

[1]: [http://blog.archive.is/post/72136308644/how-much-does-it-
cos...](http://blog.archive.is/post/72136308644/how-much-does-it-cost-you-to-
host-a-website-of)

------
venning
Thoughts:

I like the look. Very clean. I like how fast it's responding; better than
archive.org (though, obviously, they have different scaling problems).

"Your own internet archive" might be overselling it, as other commenters have
pointed out; the "Your" feels a bit misleading. I think "Save a copy of any
webpage." gives a better impression, which you use on the site itself.

The "Archive!" link probably shouldn't work if there's nothing in the URL box.
It just gives me an archive link that errors. Example: [1]

Using it on news.YC as a test gave me errors with the CSS & JS [2]. This might
be due to the fact that HN uses query parameters in their CSS and JS, which
repeat in the tesoro URL, which you may not be parsing correctly.

Maybe have something in addition to an email link for submitting error reports
like the above, just cause I'd be more likely to file a GitHub issue (even if
the repo is empty) than send a stranger an email.

As other commenters have pointed out, archive.is also does this, and their
longevity helps me feel confident that they'll still be around. Perhaps, if
you wish to differentiate, offer some way for me to "own" the copy of the
page, like downloading it or emailing it to myself or sharing it with another
site (like Google Docs or Imgur) to leverage redundancy, or something like
that. Just a thought.

All in all, nice Show HN.

EDIT: You also may want to adjust the header to work properly on mobile
devices. Still though, nice job. Sorry if I'm sounding critical.

[1]
[https://archive.tesoro.io/320b55cc9b78e271c94716ee23554da8](https://archive.tesoro.io/320b55cc9b78e271c94716ee23554da8)

[2]
[https://archive.tesoro.io/a7bf03e247224bc3b4e5a7c1f2ad42b1](https://archive.tesoro.io/a7bf03e247224bc3b4e5a7c1f2ad42b1)

~~~
agamble
Thanks! These are great comments - I'll look into the issue with saving Hacker
News CSS + JS.

------
bfirsh
What's the best way to automatically archive all of the data I produce on
websites? Facebook, Twitter, Instagram, blogs, and so on. At some point these
services will disappear, and I want to preserve them.

I know a lot of these sites have archiving features, but want something
centralised and automatic.

~~~
aethertron
The hypothetical system that makes most sense to me for this: a process that
runs 24/7 on a server, watching your feeds on those services. Grabbing and
saving everything via APIs or screen-scraping.

~~~
fiatjaf
Is that creepy resource-eater bug-prone service what makes most sense to you?

~~~
aethertron
Yes. Why do you call it 'creepy'? It's supposed to be a personal service,
owned and controlled by the user who wants to archive stuff.

And all computations consume resources, and may have bugs. So what? They can
be optimised. Bugs can get fixed. The process would, ideally, auto-update.

------
akerro
Nice, post it on
[https://www.reddit.com/r/DataHoarder/](https://www.reddit.com/r/DataHoarder/)

They will love it!

------
zippoxer
Cool tool, but by using it, you depend on it staying alive for longer than any
page you archive on it.

This got me thinking about how a decentralized p2p internet archive could
solve the trust problem that exists in centralized internet archives. Such
solution could also increase the capacity of archived pages and the frequency
at which archived pages are updated.

It is true that keeping the entire history of the internet on your local drive
is likely impossible, but a solution similar to what Sia is doing could solve
this problem: split each page to 20 pieces and distribute each piece to 10
peers such that every y pieces can recover the original page. So, you only
have to trust that 10 peers out of 20 that store a page are still alive to get
the complete page.

The main problem I can see right now would be lack of motivation to contribute
to the system -- why would people run nodes? Just because it would feature a
yet another cryptocurrency? Sure, this could hold now, but when the
cryptocurrency craze quiets down and people stop buying random
cryptocurrencies just for the sake of trading them, what then? Who would run
the nodes and why?

~~~
burkemw3
IPFS [0] and it's sibling Filecoin [1] are dealing in a very similar space as
your wonderings

[0]: [https://ipfs.io/](https://ipfs.io/) [1]:
[https://filecoin.io/](https://filecoin.io/)

------
j_s
The discussion 3 months ago on bookmarks mentioned several options for
archiving pages (some locally): _Ask HN: Do you still use browser bookmarks?_
|
[https://news.ycombinator.com/item?id=14064096](https://news.ycombinator.com/item?id=14064096)

extensions: Firefox "Print Edit" Addon / Firefox Scrapbook X / Chrome Falcon /
Firefox Recoll

open source: Zotero / WorldBrain / Wallabag

commercial: Pinboard / InstaPaper / Pocket / Evernote / Mochimarks / Diigo /
PageDash / URL Manager Pro / Save to Google / OneNote / Stash / Fetching

public: [http://web.archive.org](http://web.archive.org) /
[https://archive.is/](https://archive.is/)

------
idlewords
You're going to get this service shut down if you let anonymous people
republish arbitrary content while running everything on Google.

I (obviously) think personal archives are a great idea, but republishing is a
hornets' nest.

------
Retr0spectrum
Is this any different to archive.is?

If I want my _own_ archive, Ctrl+S in Firefox usually works fine for me.

------
crispytx
You know your site actually does a better job reproducing webpages than
archive.org. I've noticed that if you use a CDN to serve up CSS & JS for a
webpage that you're trying to archive on archive.org, it won't render
correctly. On your site, there doesn't seem to be a problem including CSS & JS
from an external domain. Thumbs up :)

~~~
agamble
OP here. Thanks! Could you point me to the pages where it worked well for you
vs archive.org?

------
zichy
So this is like archive.is, but I can't search through archived sites?

------
CM30
When you said 'own internet archive' I thought you meant some sort of program
you could download that'd save your browsing history (or whatever full website
you wanted) to your hard drive. I think that would have been significantly
more useful here.

As is it, while it's a nice service, it's still got all the issues of other
archive ones:

1\. It's online only, so one failed domain renewal or hosting payment takes
everything offline.

2\. It being online also means I can't access any saved pages if my connection
goes down or has issues.

3\. The whole thing is wide open to having content taken down by websites
wanting to cover their tracks. I mean, what do you do if someone tells you to
remove a page? What about with a DMCA notice?

It's a nice alternative to archive.is, but still doesn't really do what the
title suggests if you ask me.

------
jpalomaki
This might be a good use case for distributed storage (IPFS?).

Instead of hosting this directly on my computer, it would be interesting to
have a setup where the archiving is done via the service and I would just
provide somewhere a storage space where the content would end up being
mirrored (just to guarantee that my valuable things are saved at least
somewhere, should the the other nodes decide to remove the content).

I would prefer this setup, because it would be easily accessible for me from
any device and I would not need to worry about running some always available
system. With some suitable P2P setup my storage node would have less strict
uptime requirements.

~~~
johnaberlin
Hi jpalomaki,

Have you heard of InterPlanetary Wayback (ipwb)?
[https://github.com/oduwsdl/ipwb](https://github.com/oduwsdl/ipwb)

InterPlanetary Wayback (ipwb) facilitates permanence and collaboration in web
archives by disseminating the contents of WARC files into the IPFS network.

------
dbz
This is pretty cool. I have a chrome extension that let's you view the cached
version of a web page [1]. Would I be able to use this through an API? I
currently support Google Cache, WayBack Machine, and CoralCDN, but Coral
doesn't work well and I'd like to replace it with something else.

[1]
[https://chrome.google.com/webstore/detail/cmmlgikpahieigpccl...](https://chrome.google.com/webstore/detail/cmmlgikpahieigpcclckfmhnchdlfnjd)

~~~
agamble
OP here.

Yup, API and chrome extension are next on the feature list. :)

------
prirun
I think you should explain why you're paying Google to archive web pages for
others, ie, how do plan on benefiting from this? If you have some business
model in mind, let people know now. It's the first question that comes to my
mind when someone offers a service that is free yet costs the provider real
money. You obviously can't pay Google to archive everyone's web pages just for
the fun of it.

~~~
agamble
OP here.

Great point. Right now this is just a single rate-limited HTML form to gauge
interest. Next is to build specialty features that are worth paying for and
make this sustainable. :)

------
gorbachev
You should try and rewrite relative links in websites that get archived. I
tested your app with a news site, and all the links go to
archive.tesoro.io/sites/internal/url/structure/article.html

I also second the need for user accounts. If I am to use your site as my
personal archive, then I would need to log in and create a collection of my
own archived sites.

------
arkenflame
I made a simple Chrome extension to automatically save local copies of pages
you bookmark, if you prefer that instead:
[https://chrome.google.com/webstore/detail/backmark-back-
up-t...](https://chrome.google.com/webstore/detail/backmark-back-up-the-
page/cmbflafdbcidlkkdhbmechbcpmnbcfjf)

------
lozzo
it would be nice to have a bit of explanation on how it works and why we can
be confident that we can rely upon it

~~~
agamble
OP here. Definitely, great idea :)

Briefly: Sites are archived using a system written in Golang and uploaded to a
Google Cloud bucket.

More: The system downloads the remote HTML, parses it to extract the relevant
dependencies (<script>, <link>, <img> etc) and then downloads these as well.
Tesoro is even parsing CSS files to extract the url('...') file dependencies
from here as well, meaning most background images and fonts should continue to
work. All dependencies (even those hosted at remote domains) are downloaded
and hosted with the archive, meaning the src attributes on the original page
tags are wrangled to support the new location.

The whole thing is hosted on GCP Container Engine and I deploy with
Kubernetes.

I'll write up a more comprehensive blog post in some time, which portion of
this would you like to hear more about?

~~~
19eightyfour
The issue is cost. Your costs are disk space for people's archives, instances
for people's use, and bandwidth for the fetches and crawls and access.

How can you pay for this if it's free? It's unreliable unless its financially
viable.

~~~
agamble
Totally right, great observation :)

For now it's a free service with a single rate-limited form. Now it's time to
work on adding specialty features that are worth paying for.

------
jdc0589
> Tesoro saves linked assets, such as images, Javascript and CSS files.

I'm confused. It looks like image sources in "archived" pages on Tesoro still
point back to the origin domain.

Edit: it works as expected. I just didn't notice the relative paths.

~~~
agamble
OP here.

The site will rewrite absolute image URLs as relative ones pointing to Tesoro.
For example, in the Chicken Teryaki example on the homepage, the main image is
sourced from the relative location "static01.nyt.com/.../28COOKING-CHICKEN-
TERIYAKI1-articleLarge.jpg", which looks like it's coming from nytimes.com,
but you can check in the Chrome dev console that it isn't.

Have you found an example where it isn't working correctly? If so would you
mind posting it here and I'll fix it :).

~~~
ikreymer
Unfortunately, this approach alone will only work for sites that are mostly
static, eg. do not use JS to load dynamic content. That is a small (and
shrinking) percent of the web. Once JS is involved, all bets are off -- JS
will attempt to load content via ajax, or generate new html, load iframes, etc
and you will have 'live leaks' where the content seems to be coming form the
archive but is actually coming form the live web.

Here is an example from archiving nytimes home page:

[https://archive.tesoro.io/665dbeab57a4d57d8140f89cfedc69b5](https://archive.tesoro.io/665dbeab57a4d57d8140f89cfedc69b5)

If you look at network traffic (domain in devtools), you'll see that only a
small % is coming from archive.tesoro.io -- the rest of the content is loaded
from the live web. This can be misleading and possibly a security risk as
well.

Not to discourage you, but this is a hard problem and I've been working on for
years now. This area is a moving target, but we think live leaks are mostly
eliminated in Webrecorder and pywb, although there are lots of areas to work
on to maintain high-fidelity preservation.

If you want chat about possible solutions or want to collaborate (we're always
looking for contributors!), feel free to reach out to us at support [at]
webrecorder.io or find my contact on GH.

------
salmonfamine
Worth noting that Tesoro is the name of a major oil/fuel company in Texas.

------
NicoJuicy
When a company went down, i downloaded every one of their clients with httrack
and wget. Just to be sure their clients wouldn't lose their site. ( and some
other things)

I wonder what this site uses

------
pbhjpbhj
How are you handling copyright infringement? Outside USAs Fair Use terms this
looks like pretty blatant infringement.

~~~
iso-8859-1
What is there to handle? You take down stuff when you get an email? Most users
of this will be so small that they'll never get noticed. Maybe they won't even
be online, how are you going to know you were infringed? Maybe the crawler
allows for spoofing the user-agent.

~~~
pbhjpbhj
So, ignoring it basically.

If a person in the UK uses your service you're committing contributory
infringement for commercial purposes, AFAICT.

Moreover, the ECD has different protections than DMCA. Particularly getting a
takedown notice isn't required.

>Maybe the crawler allows for spoofing the user-agent. //

As a tort you only need to get a preponderance of evidence. IP of the crawler
that made the copy puts the owner of that IP in court for contributory
infringement, no?

If you make copies of parts of BBC sites and serve those copies from your
server how is that not copyright infringement by you??

 _FWIW I like the service and do not like the copyright regime as it stands,
particularly how UK law lacks the breadth of liberties of Fair Use._

------
skdotdan
Nice. How are you planning to pay the servers? Your service seems quite
storage-intensive.

~~~
WhiteOwlLion
I got a dedicated server in France that cost me less than $20 USD/month. 16GB
RAM, 1TB storage: [https://www.online.net/en/dedicated-server/dedibox-
xc](https://www.online.net/en/dedicated-server/dedibox-xc)

~~~
mkroman
With no redundancy, no backup and no way to extend storage. I'm not sure how
you'd archive the internet with low-range dedicated server deals.

