Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: PageDash, Your Personal Web Archive (pagedash.com)
116 points by ernsheong on Nov 8, 2017 | hide | past | favorite | 66 comments



Please consider producing archives in WARC format, and either donating captures of public pages to the Internet Archive (and other interested archives), or supporting ways for users to download their own archives in that format for them to donate them themselves and use in systems like Webrecorder.

(Note that a download of just page content and assets isn't enough; WARC stores headers, etc., also.)


Thanks for the comment. Admittedly I bypassed WARC completely as I felt overwhelmed by its technicalities in favor of how I knew the web worked. If I have a better understanding of WARC maybe that can be done, but I make no promises.


I'm not sure why this is getting downvoted.

WARC tooling is pretty terrible. I hack data and text for a living and working with WARC was extremely difficult compared to the simplicity of what I was trying to do. Something that would have taken me five minutes (extract body text) took me about 60 - 90 minutes because the first few libraries I tried didn't work easily.


I don't think it's getting downvoted because WARC is actually easy, I think it's because the creator says they can't be bothered to figure out the web archiving standard before asking people to start giving them money for their web archiving product.

The fact that WARC tooling is terrible is the opportunity this person should be leveraging to create value, imo.


I will reference this thread when I need an example of minimum viable product catching flak. Putting off WARC makes financial sense.


The response wasn't:

"I was planing to support WARC, but felt that it was far too complicated to put off the launch of this MVP. If it's an essential feature to make this product valuable then I will definitely make it a priority."

If it was "I felt overwhelmed...", "...maybe..." and "...no promises".

I appreciate the honesty, but it doesn't inspire confidence and I think the flak is more a reaction to the response regarding WARC than the lack of WARC itself.


I feel both your suggested response and the given response amount to the same thing.

I mean, looking at what you said and what they said side by side, and how I read it:

> Admittedly I bypassed WARC completely as I felt overwhelmed by its technicalities in favor of how I knew the web worked.

> I was planing to support WARC, but felt that it was far too complicated to put off the launch of this MVP.

Both mention looking at using WARC, and both mention it being complicated. Both say it's not yet available.

> If I have a better understanding of WARC maybe that can be done, but I make no promises.

> If it's an essential feature to make this product valuable then I will definitely make it a priority.

Both offer up the potential for supporting WARC. Neither promise its delivery. The former provides a bit more justification as to the condition for its inclusion (technical capabilities), while the later hides this fact.

> I appreciate the honesty,

Honesty was the only difference between the response you hoped for and the response provided.


Really??

The tone and focus of the two statements are entirely different and communicate completely different impressions of the creators relationship to the question/criticism.

I mean I don't really want to get pedantic here, but... I will anyways, weird mood I suppose:

My statement implies an understanding of the value of WARC and an intention to implement the feature, but tempered by a desire to understand the needs of the customer before taking on the task.

Their response indicates a lack of understanding of the technology (some of the honesty that I appreciate, fyi) and offers a non-commitment to implementing it, even IF they were able to understand it.

I'll admit my impression is also colored by their other responses, not simply this one in isolation.

Taking them together, the impression is that while they're willing to take money and start allowing you rely on them for archived data... they aren't going to guarantee they'll be around unless they make enough money to make it worth it for them to even both.

Again, honesty I really do appreciate, though I can't agree with the approach.

You may say that's what everyone would do, but I disagree, I think great products come from a focus on customer value and a willingness to do the hard-work to provide that value. I believe my statement communicates that far more than theirs.


You guys are fucking awful.


Yep, can't disagree with that right now... still just trying to help. Cheers!


True. My take-away: don't overshare motivations for decisions in response to MVP feedback. Instead: respond to criticism with a minimal deflection stating "it's on the roadmap".


https://github.com/webrecorder/warcio - python library.

Tooling is not the best, which is precisely why anybody would bother paying for their service. You can do lots of fancy stuff with warc and cdx index. Check out WAIL(https://github.com/N0taN3rd/wail), it works great, but it's slow and resource heavy.

EDIT: You might also want to try out webrecorder self-hosted. Both use the same library, pywb for archiving.


Both wget and it’s python spiritual successor wpull support WARC creation natively.


Do you have an API? Even if WARC isn't something you feel comfortable tackling an API or webhooks would enable other enterprising developers to add this feature.

Some commenters are knocking PageDash for being a cloud service, and that's understandable. You'll have an easier time on-boarding HN users if you make it easy for them to access and control their data.


This is something some folks have been asking. I think it's still early days, but it's something for a bit later.


For those who had not heard of the WARC format before now (which included myself), I believe this is the official specification:

http://archive-access.sourceforge.net/warc/warc_file_format-...

As indicated by the US Library of Congress[0].

0 - https://www.loc.gov/preservation/digital/formats/fdd/fdd0002...


http://fileformats.archiveteam.org/wiki/WARC

Never heard of this until now, it sounds exceptionally pragmatic and good!

I also found these WARC tools made by the folks at Internet Archive, certainly interesting:

https://github.com/internetarchive/warctools


Wait, you can't download archived pages?

ANOTHER ----id ----ing cloud service trying to replace files and programs with some BS pricing scheme?

Seriously are the "entrepreneurs" of HN even trying? How pathetic that seemingly everything on this site is someone's jobs program?


Downloading is definitely on the roadmap. Gotta start somewhere.


Fair enough, my rage is misplaced. It is hard not to get worked up when it seems like everyone is out to replace the old ways with these newer, more dis-empowering "cloud" alternatives


Founder here, happy to field questions and feedback!

Right now PageDash is quite a simple product, but hopefully with sufficient traction we can continue to implement things like full-text search, tagging support, link sharing, as well as mobile support. Your support is absolutely crucial to making PageDash come alive even more in the future.

This is my first product, thank you for being nice :)


Would it be possible to add auto-archiving to the extension?

On the $9/mo plan, I'd probably still not hit 100GB/mo of uploads.

My main reason for wanting this feature is that I could use the full text search (when it's available) to search every webpage I've visited. I find myself more and more frequently unable to find things I know existed at one point. I've been thinking of building my own solution where I just archive every page I visit on the fly then build a personal search for pages I've previously visited.


Check this out - seems to do what you want.

https://github.com/lengstrom/falcon

"Chrome extension for flexible full text browsing history search. Every time you visit a website in Chrome, Falcon indexes all the text on the page so that the site can be easily found later."


Hi there, I actually know someone else who wants this feature.

I feel like most of the sites out there are "junk", i.e. things that are not worth saving. So just save the gems. But that's me.

My feelings aside, technically this would be rather hard to do as it stands because saving a page is quite costly, i.e. mainly: 1) data retrieval, 2) data upload will cause you to not be able to use the internet seamlessly as PageDash will stall things. For some reason, even though the assets are already in the browser, PageDash as an extension cannot retrieve it directly. Also the extension right now over-mines, i.e. retrieves way more than what is actually needed to render the page, just because there's a lot of dead CSS/JS/images links out there, and I haven't figured out a way to make it more efficient.

However, if all you care about is the HTML (which is essentially what full-text searches search), then this is technically possible, but the result won't be pretty. But auto-save is technically and realistically possible if you only care about the HTML.


Chrome does support keyboard shortcuts, though. Go to chrome://extensions/, look for "Keyboard shortcuts" at the bottom right, and configure away for a quick save :)


Dig it, got some questions...

Is the data stored exclusively on your Google Cloud or can I see my archived pages while offline and backup my web archive locally?

Essentially, what guarantees do I have regarding access to my data should your startup, my local infrastructure, or civilization collapse?


Right now PageDash is centrally hosted on Google Cloud, although if there is sufficient demand I would consider a locally-hosted licensing model and bring your own S3/GCP Storage bucket, or maybe even open source.

Your question is very valid and right now all I can say is PageDash aims to provide a way for you to retrieve your data either via direct downloads or bucket syncing for the technically inclined. Essentially each page is simply a folder with all the assets stored in a flat structure, so there is no fancy tech here. Once the page is processed and archived, it's a done deal. My personal guarantee is that as long as someone is paying and PageDash is not going hugely red I will keep the service up. The free quota is meant as a trial and you can help keep PageDash up by subscribing :)

Cloud brings other benefits such as the ability to retrieve from anywhere, as well as use multiple clients. But yes, I can understand the concern, it is a valid one.


Love what you guys did wish PageDash. Archiving locally is perhaps the main reason why I'm paying Evernote and not PageDash, however. However, I've never been happy with the way Evernote tries to be a filing system and I like the simplicity of PageDash. That said If I'm going to invest time to archive I would like to know that the effort won't be wasted if you go out of business.


I'm sorry, I don't think your product, as it is, provides enough value for the cost.

I would suggest using the WARC format someone else mentioned as well as providing a mechanism for backing up locally and a guarantee that all data will be available for X weeks/months/years following any buyout or failure of the business.

Good luck!


Why must everything be a cloud service? I use ScrapBook (http://www.xuldev.org/scrapbook/) to save web pages locally.


Because without a cloud service you can't get people to pay $3-$9/month for something like this.


I really like the idea of this and other, similar products/services. I haven't used any of them since they don't seem to be exactly what I want.

What I really want is auto tagging and classification + semantic search. I don't even really want to have to save the page. I want this functionality on my browsing history.

Maybe some increased functionality for saving specific types of pages. If I save a recipe, I want the service to recognize that it's a recipe and put it in my 'cookbook'. With a consistent format, if possible. If I save a blog post, tag the topic, technology and language used.


I really want auto-tagging via ML classification as well, it's one of the things I wanted other than one-click save when I started the project. That's a really nice to have at the moment and can only be achieved once PageDash matures more. Right now the closest antidote I can offer for your use is to configure a keyboard shortcut to do the extension saving via chrome://extensions > Keyboard shortcuts (bottom) for quick saving.


Yeah, I realized after I posted that this is a good first step to that goal; a way to gather a bunch of user search and tagging. Thanks for sharing.


The previous web archive launched on HN is already dead [0]. Many of the comments from that discussion also apply here. Good luck and I hope you'll manage to stay online!

[0]: https://news.ycombinator.com/item?id=14644441


Open-source, self-hosted alternative(s) discussed within days of the above:

Wallabag: a self-hostable application for saving web pages | https://news.ycombinator.com/item?id=14686882 (2017Jul:166points,53comments)


What are the advantages of this over something like pinboard.in?


Unless you are on Pinboard's archiving plan, Pinboard mostly manages just your bookmarks. PageDash doesn't claim to be a bookmark manager, but it really can be one. Bookmark the page, along with the content.


Signed up - saved my first page - and viewed my dash within 5 mins. Good stuff. Now, all you need is not to go out of business (or open source before you do). Seriously though, good luck on the business side.


Thank you. You're right, hopefully business side holds up. I'll keep it up as long as someone is paying me :)


I should point out that PageDash also tries to handle saving nested pages and iframes, I'm not sure it's something that other archivers try to do.

Also Web Components (custom elements, shadow DOM) support is definitely do-able and something for the pipeline. It's not something even the Internet Archive is capable of right now. Wayback Machine's youtube.com archive is blank.


Looks interesting. Why would I use PageDash over something like Evernote or Pocket?


Good question. PageDash aims to preserve the page in the original format and render it just as you saw it. Right now, Evernote does quite a bad job at rendering, I've used it a lot. Pocket on the other hand specializes at stripping out the HTML and leaving just the content in a reader-mode fashion, though I've not tried their premium offering that also archives.

PageDash archives from the front-end, while many archivers tend to archive by sending a link to the backend which then queries the website remotely, so you might not be archiving what you saw exactly, which admittedly in many cases doesn't matter. The upside of this technicality is that you can save content that you see only when you are logged in!


Sounds like PageDash is not really for me then, but I can definitely see why some others might want to use it. Best of luck!


Hi Michael, what is your use case? I do plan to include a reader mode in PageDash to view pages in a clean layout. It is possible because PageDash has all the raw data available for each page.


Anything FOSS in this sphere? I'm slowly going towards building my solution to automatically archive my Firefox bookmarks locally, but a bit too slowly.


There's plenty.

Here's a pretty comprehensive list that someone else made: https://news.ycombinator.com/item?id=14647119

Here's another FOSS that I found: https://github.com/pirate/bookmark-archiver


Have you considered saving files (such as fonts and JS libs) loaded through major CDNs centrally just once instead of storing it again each time a page is saved?

Maybe you already have plans for this, but it would be smart to implement a system that checks whether files are already present on your server so you don't waste any of your user's quota and the server's disk space.


Thanks for the comment! That would be ideal and it has crossed my mind but I have given little thought on how to do de-duplication right (premature optimization from a maker's perspective). Right now each page and its assets sit within it's own "bucket". But yes page assets and all these dependencies can really add up fast.


Excellent work. I can now close all those browser tabs I've had open in the background for weeks, just so I don't lose the page.


Thank you! Would really love to hear your feedback on the product, warts and all. jonathan[[at]]pagedash.com


wouldn't OneTab extension be a better solution. I see PageDash as a personal Internet Archive/Wayback Machine


Possibly yes for the problem as I described it. But really what I want is to mark pages that I thought were interesting and _might_ want to see again at some point in my life. Maybe that will be in two years. My approach of leaving a load of tabs open for a few weeks catches many of the cases of pages I want to re-read, but is obviously not feasible for years.

And really I want to be able to search my history with queries like (contains foo, not bar, was marked sometime in 2014 or 2015). And see my history in chronological order. The problem with my browser's history is that it is polluted with the pages I never want to see again, like last week's weather forecast.

I'm sure that there's something that does this already, maybe it is Evernote. But PageDash's guarantee of preserving the historical state seems novel and worthwhile.


I'm sure Evernote could do this.

I've been using two different kinds of apps for this:

- Offline-reading apps for stuff I want to read - (Tech) support ticket apps for stuff I (might) want to do

I'm currently using Pocket for the first kind. Before that I was using Instapaper.

For the second kind I'm using GitLab but I've still got a lot of old content in FogBugz that's not yet migrated. Neither saves or indexes the contents of links tho so I manually include text about what it is I wanted to do or sometimes just specific text I'd expect future-me to use to find the stuff.

The main reason I use GitLab for stuff related to something I might want to do is that I use it for all of my other 'project' content too so it's nice to have everything of that nature in one place.


I just tried Evernote. I had a bunch of tabs open in Chrome. I installed both PashDash and Evernote extensions, then click their respective toolbar buttons. Here's what ensued.

PashDash: The page was saved. I went to app.pagedash.com and could see it at the top of my history.

Evernote: The extension popped up a dialog asking whether I wanted to clip an Article, Small Article, Bookmark. I didn't know the difference so just clicked Save. I got a dialog telling me I needed to reload the page. So I did and clicked the button again. That worked but took much longer than PageDash then showed another dialog saying "No Result" in large print and "Clipped to First Notebook" in small print. Ugghh. I visited the Evernote website and got a massive popup asking me if my email was still the one that I'd registered four minutes earlier. I clicked through and was redirected to a massive long URL. I could see my clipped article there, but its unclear if this massive URL is the one I should bookmark.

It looks like PashDash is simpler and faster than Evernote. I like simple and fast.

Unfortunately it looks like I'll run out of my monthly storage allowance on PashDash very quickly. And I can't see any search function in PashDash.


Thanks for the feedback. Search is high on my priority list next (already started looking into it)! And I really meant the free quota to be more of a try-PageDash thing, so I'd really appreciate the support via a paid subscription, because ultimately that would keep me going. Please do let me know via support@pagedash if you feel the pricing is a bit steep, I can open up an even lighter plan. But I know that not being able to search is a deal-breaker, so stay tuned for that.


After signing in, my initial reaction is that I wish I didn't have to use a browser extension to save a page.

It would be handy if I could just enter a URL and have it saved, a la Pinboard or Instapaper.

That said, this worked very well on my first try.


Thanks for the comment! Maybe I will make that possible in the future, but for now the advantage of this is that you can save logged-in content, i.e. content that you see when you're logged in. Passing the URL to backend prevents that as the backend is not authenticated, or even worse blocked.


Alright folks, it's 3am where I am at the moment, gonna hit the sack. I'll address more questions and concerns tomorrow. Thank you for all your feedback!


So this works until your cloud site dies. No thanks.


There are a few ways I can go mitigating this.

1) One of them is to provide PageDash with API access to your s3/gcp bucket so that it syncs your pages out to your bucket.

2) Providing an open-source viewer to view files saved within your bucket. It's just like serving a website, really, no more processing needed.


This will be great for archiving the past :)


Which media files are stored? I assume images yes, videos no?


Images yes, but videos not yet.


Very interesting idea


Awesome stuff!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: