
Show HN: Transform any web page into a document - apancyborg
https://documentcyborg.com/
======
mixedbit
Nice UI and nicely formatted output.

Why does it return .zip that needs to be unpacked? To save bandwidth your
could just use gzip 'Content-Encoding' end return the format requested by the
user, which would be unpacked by the browser.

Returned file name is [documentcyborg.com].zip, it would be nicer if the
domain of the requested document was used instead.

~~~
apancyborg
For zip file it is not just to save bandwidth, it is also to save disk space
as we don't know yet how many user will use it, we preferred to be cautious
but in the future you are right we can give straight the file once we know how
much disk space will be needed to sustain it. For the returned document name,
this is up to discussion if a lot of user ask for it we will change it. Thanks
for using the apps and for your comment.

~~~
martin-adams
As a user, if I ask for a DOC I want a DOC. If your service gives me a zip,
it's not what I asked for. As a programmer, this is quite simple to solve as
others have said how (store gzipped and serve with correct headers).

~~~
apancyborg
Understood, we will change that.

------
koolba
That hostname is very difficult to read. I kept squinting and though it was
"Documentcy Borg". If you're going to put together something like you need a
shorter, catchier name.

Here's a freebie name that's (as of writing this) unregistered: page2doc.com

~~~
apancyborg
Unfortunately our boss The Cyborg doesn't want us to change the name. As his
minions we can't do much but if we escape we promise to free Document from the
Cyborg:)

------
vjandrea
1) the zipped download has always the same name, becomes quickly confusing if
you're grabbing several pages. Maybe adding a timestamp or, better, a title
snippet could help.

2) found html tags and incorrect whitespace when exporting to TXT this page:
[https://www.packtpub.com/packt/offers/free-
learning](https://www.packtpub.com/packt/offers/free-learning)

I want to try this tool with pages that Instapaper fails to grab.

~~~
apancyborg
1) Understood, for the moment we have not yet decied which way we go but for
sure we will stop giving only [documentcyborg.com].zip file 2) We found some
bug in the export to rtf and txt (its getting fix), the other export are
working correctly.

For the Instataper pages that fail if you could send us some domain name you
wish : hn at documentcyborg.com and we will test the parser against it to make
sure it works.

~~~
dredmorbius
Food for thought:

1\. Externally-provided content is dangerous. You might use a _hash_ of the
domain name, but I'd avoid files named after the sources.

2\. Metadata such as a date would be useful.

3\. Despite 1, a highly-sanitised hostname could be informative. An iconv to
8-bit ASCII [-a-zA-Z0-9_], and _not_ allowing the first character to be '-'
might be a start. Put a length limit on that as well.

~~~
apancyborg
Thanks, for 1) this is based on extraction not source so very easy to
mitigate, 2)right we can include it in the filename.3)we have include most of
it in the new updated version online. Thank you very much for your report.

------
waldfee
When i tried it with itself i got "We couldn't find any text for creating the
document. Please send us the problematic link :
[https://documentcyborg.com/](https://documentcyborg.com/) via our contactus
form."

other wise nice tool, thx

~~~
apancyborg
We will let our parser work on our website. We understand the point.

------
zepolen
I can't put down a person for effort, but this tool doesn't work at all. Over
half the pages I tried are missing content entirely.

~~~
apancyborg
If can please send us just the domain name that failed at: hn at
documentcyborg.com , we will fix all the problem that you encountered on
theses domain.

~~~
fareesh
I casually tried it on a reddit post but all I got was one line:

[https://www.reddit.com/r/DotA2/comments/50neqc/will_future_u...](https://www.reddit.com/r/DotA2/comments/50neqc/will_future_updates_decrease_fps_even_further/)

------
themapper
Nice UI, but it doesn't work in my case. I've tried a bunch of urls from
stackexchange and the output is not readable (not properly formatted). But
anyway, what I suggest to do is to add feature that I'm looking for (and maybe
not only me?) is to convert website to some kind of book, for example in pdf.
As I remember Adobe PDF has this functionality included, but I 'm not aware of
any web app that can do this. So, basically you can parse whole website and
create book out of it. So, anyway good luck!

~~~
apancyborg
We tried several stackexchange url and it did work, for example :
[http://codegolf.stackexchange.com/questions/92138/the-
letter...](http://codegolf.stackexchange.com/questions/92138/the-letter-e-
with-e) which has a complicated layout, generate an epub or anything else and
you will see the quality is good, we agree that the pdf renderer is not as
good as the other yet but we are working on it. If you pls send us at hn at
documentcyborg.com or via comment here on hn the link that didn't work so we
can get better that would be great. For the book idea the problem is when to
stop, if you make a book out of amazon it will take years. Maybe the
possibility to add several url and from that generate a "book" document.
Thanks for trying the apps, we just wished it was up to your standard but we
are constantly improving it since 3 day ago when we launched, so hopefully we
will get there soon.

~~~
themapper
For example I've tried this one:
[http://gis.stackexchange.com/questions/102555/automatically-...](http://gis.stackexchange.com/questions/102555/automatically-
selecting-feature-when-another-feature-in-different-layer-shares-a). Only
question text saved into document. As for the book from urls, there could be
some options how deep to scan urls, only from current page or all subpage,
etc.

~~~
apancyborg
OK understood, for the moment we get only the main content, we are working on
a new "parser"that can fetch the main content and comment attached to it. For
this you are right we don't extract content but we expect to have this
released soon. For the book you are right, we will think about it and how to
integrate it. Thanks for the feedback, this is greatly appreciated.

------
clusmore
Interestingly, the webpage itself cannot be turned into a document.

>We couldn't find any text for creating the document. Please send us the
problematic link : [https://documentcyborg.com/](https://documentcyborg.com/)
via our contactus form.

This was the first page I thought to try it on, so you might want to consider
adding more text to your landing page so that it will work.

~~~
econnors
I did the same thing, this is definitely worth looking into.

~~~
apancyborg
:) Ok we will look into it, I understand the point.

------
aq3cn
I tried this wikihow link and it only extracts no more than a paragraph from
webpage in pdf or word. [http://www.wikihow.com/Use-Physical-Therapy-to-
Relieve-Back-...](http://www.wikihow.com/Use-Physical-Therapy-to-Relieve-Back-
Pain)

I don't understand what is the advantage of server side processing. I have
always depended on Evernote Clearly and Redability mobilizer to turn any web
page into nice text which can be copied and pasted on MS Word. Then I can use
Prince ([http://www.princexml.com/](http://www.princexml.com/)) to batch
convert docx into whatever I want. If someone finds it slow then they can
enable auto capture clipboard. press ctrl+` to enable Clearly, ctrl+A to
select all text, ctrl+C to send it to clipboard.

I wish if someone can work on this to make it smother for a desktop users.

PS: I also make use of reading view mode of Firefox.

~~~
apancyborg
all fixed

------
coldshower
I use the DuckDuckGo bang !pf to convert webpages into PDFs (got the tip from
here: [http://duckgobang.com](http://duckgobang.com)) but your tool converts
to multiple file formats which is even better.

~~~
apancyborg
We don't just convert to other format we try to extract the meaningful content
only and not the whole page. Thank you very much for your positive comment.

------
rsync
Hmmm...

In some ways, Oh By[1] performs the exact opposite role. Which is to say, Oh
By allows you to transform any document into a web page.

Well, any document 4096 characters or shorter ...

[1] [https://0x.co](https://0x.co)

------
webtechgal
Great!! Works like a charm... :-)

While you're at it, why not throw in the conversion to a few (popular) image
formats as well? I can think of (at least) some scenarios where that would
come in quite handy (e.g. posting long articles to Twitter).

All the best moving forward.

~~~
apancyborg
Nice to hear, we are thinking about adding image export and also markdown
export. We are awaiting more feedback regarding those two and based on that we
will move forward. Thanks for your comment.

------
jmnicolas
I tried it on this webpage : [https://www.rt.com/news/357833-france-coca-cola-
cocaine/](https://www.rt.com/news/357833-france-coca-cola-cocaine/)

Not bad but the title of the article is missing.

The tweet is missing too but I can't decide if it's a good or a bad thing.

There should be options to remove pictures too I think.

Honestly I'd be interested by a standalone product like this. I don't like the
fact that you know everything I store.

~~~
apancyborg
The title of the article is the name of document, we understand that this is
misleading so we can add it at the top of the document. For the Tweet, this is
up for discussion for the moment our parser remove Media Card, we could add an
option to enable or disable Media Card. For the picture this is an improvement
we could do easily. Actually we do not know what document you generate, but I
understand the need for end to end privacy for such product. Thanks for your
comment and for using the app.

~~~
jmnicolas
Sorry I forgot to mention : I generated a PDF.

Later I tried a text export of the same page, some HTML remains (</div>
elements).

Using the title of the page to name the zip would be nice too.

~~~
apancyborg
We found some bug in the rtf and txt export, we are fixing them right now. We
are going to change the way we serve the file either as user mixedbit
suggested it or other that work for sure cross browser and trough proxy and so
on. Thanks for the report.

~~~
apancyborg
all fixed

------
ksk
This is a useful service. However, I picked the first link on medium.com and
the PDF conversion didn't go that well. The layout of some of text gets messed
up.

[https://medium.com/@subes01/this-is-your-life-in-silicon-
val...](https://medium.com/@subes01/this-is-your-life-in-silicon-
valley-933091235095#.6trg4krj4)

~~~
apancyborg
The layout for doc,odt,epub,rtf,txt is as good as the medium article, for the
pdf you are right, we are fixing our pdf renderer to not overflow element
right now. Will keep you updated once done.

~~~
apancyborg
On some epub reader we have also an overflow, fixing it now, will let you know
once fixed.

~~~
apancyborg
both issue are now fixed, the epub and pdf renderer is now working correctly

------
lousken
Tried on
[http://www.washingtonpost.com/sf/business/wp/2016/08/13/2016...](http://www.washingtonpost.com/sf/business/wp/2016/08/13/2016/08/13/tim-
cook-the-interview-running-apple-is-sort-of-a-lonely-job/) and export to pdf,
not what I was hoping for.

~~~
apancyborg
You are right, working on fix now.

------
anotheryou
Maybe you mean "turn any web-article in to a document" ?

It seems to search for the largest body of text and omit the rest.

~~~
apancyborg
You are right. We need to change the headline to reflect this, even if the end
goal is to turn any web page into a doc.

~~~
apancyborg
fixed the new title is : Transform any web article into a document

------
Veen
I tried it with
[http://plato.stanford.edu/entries/sartre/](http://plato.stanford.edu/entries/sartre/)
and got the message: "You are now a certified Document Cyborg hacker, but you
don't get any files:)"

~~~
apancyborg
Please can you tell which browser are you using, this should happen only if
you disabled javascript or try to change the parameter type of the export. If
that's neither your case we will trow one of our programmer to the tiger:)

~~~
Veen
Alright, I'm dumb. It was my JavaScript blocker. It works once I disabled it.

Nice service. Only advice I can give is to tweak the typography a bit for the
PDF output: I'd restrict the measure (line lengths) to around sixty
characters, and boost the leading (space between lines) to about 1.5 times the
line-height. Personally, I'd make the body text a bit smaller too, but the
bigger text might be preferred by some.

~~~
apancyborg
Great to hear, we will better communicate on the error to ask for people to
enable javascript. For the pdf output, we will look into it as it is a complex
issue if you have for example mixed content non latin character with latin
character. Thanks for the feedback.

------
raihansaputra
Great tool, but reddit threads with a lot of comments only fetches the first
one. I know it's quite hard with the child comments and the number of them,
but I really wish there's a way to take them offline for me to read.

------
anotheryou
Feature Suggestion: single-page pdfs that look like in the browser. (basicly
like the result of a fireshot screenshot in a pdf, but keeping the text as
text, not an image)

Great for offlineish portfolios.

~~~
seszett
[https://fprin.tf](https://fprin.tf) does that, but... "this service is still
in beta and probably not ready for production use yet" and it doesn't seem to
target the same users as
[https://documentcyborg.com](https://documentcyborg.com)

------
anonbanker
Bookmarked. I wish I would've had this back in May, while preparing exhibits
for affidavits I was filing. I'll definitely be making use of this!

~~~
apancyborg
Thanks

------
dredmorbius
Is there anything this does (I've not looked at it) that a 'curl | pandoc'
pipe wouldn't?

~~~
apancyborg
Just try it on wikipedia and you will see the diff.

~~~
dredmorbius
Hrm. Font's small and linewidths too wide.

Interesting idea though.

------
giis
Nice UI and worked fine! Just yesterday, I wanted to download some text from
specific site, this is perfect!

~~~
apancyborg
Thanks

------
zackify
Ironic, they have a twitter icon in the top right, yet getting a pdf of a
twitter account doesn't even work.

~~~
apancyborg
you are right , we still have problem when the content is low on text to
extract, working on it.

------
aswanson
This is dope. Thanks.

~~~
apancyborg
Thanks

