
Show HN: PageDash, Your Personal Web Archive - ernsheong
https://www.pagedash.com/
======
vitovito
Please consider producing archives in WARC format, and either donating
captures of public pages to the Internet Archive (and other interested
archives), or supporting ways for users to download their own archives in that
format for them to donate them themselves and use in systems like Webrecorder.

(Note that a download of just page content and assets isn't enough; WARC
stores headers, etc., also.)

~~~
ernsheong
Thanks for the comment. Admittedly I bypassed WARC completely as I felt
overwhelmed by its technicalities in favor of how I knew the web worked. If I
have a better understanding of WARC maybe that can be done, but I make no
promises.

~~~
bravura
I'm not sure why this is getting downvoted.

WARC tooling is pretty terrible. I hack data and text for a living and working
with WARC was extremely difficult compared to the simplicity of what I was
trying to do. Something that would have taken me five minutes (extract body
text) took me about 60 - 90 minutes because the first few libraries I tried
didn't work easily.

~~~
gooseus
I don't think it's getting downvoted because WARC is actually easy, I think
it's because the creator says they can't be bothered to figure out the web
archiving standard before asking people to start giving them money for their
web archiving product.

The fact that WARC tooling is terrible is the opportunity this person should
be leveraging to create value, imo.

~~~
j_s
I will reference this thread when I need an example of minimum viable product
catching flak. Putting off WARC makes financial sense.

~~~
gooseus
The response wasn't:

"I was planing to support WARC, but felt that it was far too complicated to
put off the launch of this MVP. If it's an essential feature to make this
product valuable then I will definitely make it a priority."

If it was "I felt overwhelmed...", "...maybe..." and "...no promises".

I appreciate the honesty, but it doesn't inspire confidence and I think the
flak is more a reaction to the response regarding WARC than the lack of WARC
itself.

~~~
jasonlotito
I feel both your suggested response and the given response amount to the same
thing.

I mean, looking at what you said and what they said side by side, and how I
read it:

> Admittedly I bypassed WARC completely as I felt overwhelmed by its
> technicalities in favor of how I knew the web worked.

> I was planing to support WARC, but felt that it was far too complicated to
> put off the launch of this MVP.

Both mention looking at using WARC, and both mention it being complicated.
Both say it's not yet available.

> If I have a better understanding of WARC maybe that can be done, but I make
> no promises.

> If it's an essential feature to make this product valuable then I will
> definitely make it a priority.

Both offer up the potential for supporting WARC. Neither promise its delivery.
The former provides a bit more justification as to the condition for its
inclusion (technical capabilities), while the later hides this fact.

> I appreciate the honesty,

Honesty was the only difference between the response you hoped for and the
response provided.

~~~
gooseus
Really??

The tone and focus of the two statements are entirely different and
communicate completely different impressions of the creators relationship to
the question/criticism.

I mean I don't really want to get pedantic here, but... I will anyways, weird
mood I suppose:

My statement implies an understanding of the value of WARC and an intention to
implement the feature, but tempered by a desire to understand the needs of the
customer before taking on the task.

Their response indicates a lack of understanding of the technology (some of
the honesty that I appreciate, fyi) and offers a non-commitment to
implementing it, even IF they were able to understand it.

I'll admit my impression is also colored by their other responses, not simply
this one in isolation.

Taking them together, the impression is that while they're willing to take
money and start allowing you rely on them for archived data... they aren't
going to guarantee they'll be around unless they make enough money to make it
worth it for them to even both.

Again, honesty I really do appreciate, though I can't agree with the approach.

You may say that's what everyone would do, but I disagree, I think great
products come from a focus on customer value and a willingness to do the hard-
work to provide that value. I believe my statement communicates that far more
than theirs.

~~~
ollerac
You guys are fucking awful.

~~~
gooseus
Yep, can't disagree with that right now... still just trying to help. Cheers!

------
ernsheong
Founder here, happy to field questions and feedback!

Right now PageDash is quite a simple product, but hopefully with sufficient
traction we can continue to implement things like full-text search, tagging
support, link sharing, as well as mobile support. Your support is absolutely
crucial to making PageDash come alive even more in the future.

This is my first product, thank you for being nice :)

~~~
CJKinni
Would it be possible to add auto-archiving to the extension?

On the $9/mo plan, I'd probably still not hit 100GB/mo of uploads.

My main reason for wanting this feature is that I could use the full text
search (when it's available) to search every webpage I've visited. I find
myself more and more frequently unable to find things I know existed at one
point. I've been thinking of building my own solution where I just archive
every page I visit on the fly then build a personal search for pages I've
previously visited.

~~~
ernsheong
Hi there, I actually know someone else who wants this feature.

I feel like most of the sites out there are "junk", i.e. things that are not
worth saving. So just save the gems. But that's me.

My feelings aside, technically this would be rather hard to do as it stands
because saving a page is quite costly, i.e. mainly: 1) data retrieval, 2) data
upload will cause you to not be able to use the internet seamlessly as
PageDash will stall things. For some reason, even though the assets are
already in the browser, PageDash as an extension cannot retrieve it directly.
Also the extension right now over-mines, i.e. retrieves way more than what is
actually needed to render the page, just because there's a lot of dead
CSS/JS/images links out there, and I haven't figured out a way to make it more
efficient.

However, if all you care about is the HTML (which is essentially what full-
text searches search), then this is technically possible, but the result won't
be pretty. But auto-save is technically and realistically possible if you only
care about the HTML.

~~~
ernsheong
Chrome does support keyboard shortcuts, though. Go to chrome://extensions/,
look for "Keyboard shortcuts" at the bottom right, and configure away for a
quick save :)

------
teddyh
Why must everything be a cloud service? I use ScrapBook
([http://www.xuldev.org/scrapbook/](http://www.xuldev.org/scrapbook/)) to save
web pages locally.

~~~
wongarsu
Because without a cloud service you can't get people to pay $3-$9/month for
something like this.

------
CabSauce
I really like the idea of this and other, similar products/services. I haven't
used any of them since they don't seem to be exactly what I want.

What I really want is auto tagging and classification + semantic search. I
don't even really want to have to save the page. I want this functionality on
my browsing history.

Maybe some increased functionality for saving specific types of pages. If I
save a recipe, I want the service to recognize that it's a recipe and put it
in my 'cookbook'. With a consistent format, if possible. If I save a blog
post, tag the topic, technology and language used.

~~~
ernsheong
I really want auto-tagging via ML classification as well, it's one of the
things I wanted other than one-click save when I started the project. That's a
really nice to have at the moment and can only be achieved once PageDash
matures more. Right now the closest antidote I can offer for your use is to
configure a keyboard shortcut to do the extension saving via
chrome://extensions > Keyboard shortcuts (bottom) for quick saving.

~~~
CabSauce
Yeah, I realized after I posted that this is a good first step to that goal; a
way to gather a bunch of user search and tagging. Thanks for sharing.

------
rahiel
The previous web archive launched on HN is already dead [0]. Many of the
comments from that discussion also apply here. Good luck and I hope you'll
manage to stay online!

[0]:
[https://news.ycombinator.com/item?id=14644441](https://news.ycombinator.com/item?id=14644441)

~~~
j_s
Open-source, self-hosted alternative(s) discussed within days of the above:

Wallabag: a self-hostable application for saving web pages |
[https://news.ycombinator.com/item?id=14686882](https://news.ycombinator.com/item?id=14686882)
(2017Jul:166points,53comments)

------
Accacin
What are the advantages of this over something like pinboard.in?

~~~
ernsheong
Unless you are on Pinboard's archiving plan, Pinboard mostly manages just your
bookmarks. PageDash doesn't claim to be a bookmark manager, but it really can
be one. Bookmark the page, along with the content.

------
adityar
Signed up - saved my first page - and viewed my dash within 5 mins. Good
stuff. Now, all you need is not to go out of business (or open source before
you do). Seriously though, good luck on the business side.

~~~
ernsheong
Thank you. You're right, hopefully business side holds up. I'll keep it up as
long as someone is paying me :)

------
ernsheong
I should point out that PageDash also tries to handle saving nested pages and
iframes, I'm not sure it's something that other archivers try to do.

Also Web Components (custom elements, shadow DOM) support is definitely do-
able and something for the pipeline. It's not something even the Internet
Archive is capable of right now. Wayback Machine's youtube.com archive is
blank.

------
michaelmior
Looks interesting. Why would I use PageDash over something like Evernote or
Pocket?

~~~
ernsheong
Good question. PageDash aims to preserve the page in the original format and
render it just as you saw it. Right now, Evernote does quite a bad job at
rendering, I've used it a lot. Pocket on the other hand specializes at
stripping out the HTML and leaving just the content in a reader-mode fashion,
though I've not tried their premium offering that also archives.

PageDash archives from the front-end, while many archivers tend to archive by
sending a link to the backend which then queries the website remotely, so you
might not be archiving what you saw exactly, which admittedly in many cases
doesn't matter. The upside of this technicality is that you can save content
that you see only when you are logged in!

~~~
michaelmior
Sounds like PageDash is not really for me then, but I can definitely see why
some others might want to use it. Best of luck!

~~~
ernsheong
Hi Michael, what is your use case? I do plan to include a reader mode in
PageDash to view pages in a clean layout. It is possible because PageDash has
all the raw data available for each page.

------
gkya
Anything FOSS in this sphere? I'm slowly going towards building my solution to
automatically archive my Firefox bookmarks locally, but a bit too slowly.

~~~
ernsheong
There's plenty.

Here's a pretty comprehensive list that someone else made:
[https://news.ycombinator.com/item?id=14647119](https://news.ycombinator.com/item?id=14647119)

Here's another FOSS that I found: [https://github.com/pirate/bookmark-
archiver](https://github.com/pirate/bookmark-archiver)

------
nels
Have you considered saving files (such as fonts and JS libs) loaded through
major CDNs centrally just once instead of storing it again each time a page is
saved?

Maybe you already have plans for this, but it would be smart to implement a
system that checks whether files are already present on your server so you
don't waste any of your user's quota and the server's disk space.

~~~
ernsheong
Thanks for the comment! That would be ideal and it has crossed my mind but I
have given little thought on how to do de-duplication right (premature
optimization from a maker's perspective). Right now each page and its assets
sit within it's own "bucket". But yes page assets and all these dependencies
can really add up fast.

------
abainbridge
Excellent work. I can now close all those browser tabs I've had open in the
background for weeks, just so I don't lose the page.

~~~
vpvp
wouldn't OneTab extension be a better solution. I see PageDash as a personal
Internet Archive/Wayback Machine

~~~
abainbridge
Possibly yes for the problem as I described it. But really what I want is to
mark pages that I thought were interesting and _might_ want to see again at
some point in my life. Maybe that will be in two years. My approach of leaving
a load of tabs open for a few weeks catches many of the cases of pages I want
to re-read, but is obviously not feasible for years.

And really I want to be able to search my history with queries like (contains
foo, not bar, was marked sometime in 2014 or 2015). And see my history in
chronological order. The problem with my browser's history is that it is
polluted with the pages I never want to see again, like last week's weather
forecast.

I'm sure that there's something that does this already, maybe it is Evernote.
But PageDash's guarantee of preserving the historical state seems novel and
worthwhile.

~~~
aeorgnoieang
I'm sure Evernote could do this.

I've been using two different kinds of apps for this:

\- Offline-reading apps for stuff I want to read \- (Tech) support ticket apps
for stuff I (might) want to do

I'm currently using Pocket for the first kind. Before that I was using
Instapaper.

For the second kind I'm using GitLab but I've still got a lot of old content
in FogBugz that's not yet migrated. Neither saves or indexes the contents of
links tho so I manually include text about what it is I wanted to do or
sometimes just specific text I'd expect future-me to use to find the stuff.

The main reason I use GitLab for stuff related to something I might want to do
is that I use it for all of my other 'project' content too so it's nice to
have everything of that nature in one place.

~~~
abainbridge
I just tried Evernote. I had a bunch of tabs open in Chrome. I installed both
PashDash and Evernote extensions, then click their respective toolbar buttons.
Here's what ensued.

PashDash: The page was saved. I went to app.pagedash.com and could see it at
the top of my history.

Evernote: The extension popped up a dialog asking whether I wanted to clip an
Article, Small Article, Bookmark. I didn't know the difference so just clicked
Save. I got a dialog telling me I needed to reload the page. So I did and
clicked the button again. That worked but took much longer than PageDash then
showed another dialog saying "No Result" in large print and "Clipped to First
Notebook" in small print. Ugghh. I visited the Evernote website and got a
massive popup asking me if my email was still the one that I'd registered four
minutes earlier. I clicked through and was redirected to a massive long URL. I
could see my clipped article there, but its unclear if this massive URL is the
one I should bookmark.

It looks like PashDash is simpler and faster than Evernote. I like simple and
fast.

Unfortunately it looks like I'll run out of my monthly storage allowance on
PashDash very quickly. And I can't see any search function in PashDash.

~~~
ernsheong
Thanks for the feedback. Search is high on my priority list next (already
started looking into it)! And I really meant the free quota to be more of a
try-PageDash thing, so I'd really appreciate the support via a paid
subscription, because ultimately that would keep me going. Please do let me
know via support@pagedash if you feel the pricing is a bit steep, I can open
up an even lighter plan. But I know that not being able to search is a deal-
breaker, so stay tuned for that.

------
pwenzel
After signing in, my initial reaction is that I wish I didn't have to use a
browser extension to save a page.

It would be handy if I could just enter a URL and have it saved, a la Pinboard
or Instapaper.

That said, this worked very well on my first try.

~~~
ernsheong
Thanks for the comment! Maybe I will make that possible in the future, but for
now the advantage of this is that you can save logged-in content, i.e. content
that you see when you're logged in. Passing the URL to backend prevents that
as the backend is not authenticated, or even worse blocked.

------
ernsheong
Alright folks, it's 3am where I am at the moment, gonna hit the sack. I'll
address more questions and concerns tomorrow. Thank you for all your feedback!

------
ff7c11
So this works until your cloud site dies. No thanks.

~~~
ernsheong
There are a few ways I can go mitigating this.

1) One of them is to provide PageDash with API access to your s3/gcp bucket so
that it syncs your pages out to your bucket.

2) Providing an open-source viewer to view files saved within your bucket.
It's just like serving a website, really, no more processing needed.

------
tmlee
This will be great for archiving the past :)

------
Maarius
Which media files are stored? I assume images yes, videos no?

~~~
ernsheong
Images yes, but videos not yet.

------
bobbyongce
Very interesting idea

------
tevanraj
Awesome stuff!

