

Show HN: Download all threads of an Orkut community - anuragbiyani
https://github.com/abiyani/orkut-community-downloader

======
fiatjaf
Instead of trying to fix the bugs, maybe you should look at what I think would
be better tools for accomplishing the job:

* PhantomJS[1] (with or without the facilitator CasperJS[2]) -- in fact I'm kinda motivated to write a script myself using CasperJS to save my communities, now that you remembered me they are there waiting to be saved;

* Selenium[3] (I never looked at this, but it seems interesting as it holds the slogan "Browser Automation").

[1]: [http://phantomjs.org/](http://phantomjs.org/)

[2]: [http://casperjs.org/](http://casperjs.org/)

[3]: [http://www.seleniumhq.org/](http://www.seleniumhq.org/)

~~~
anuragbiyani
Thanks for sharing these links! I did take a brief look at Selenium, but due
to my lack of familiarity with it and no mention on how to save the full page
in html (instead of screenshot) I moved quickly to writing this version using
xdotool. I agree this approach is more clunky and perhaps less likely to work
across different platforms (than say Selenium/CasperJS), but it finally served
the purpose for me (Ubuntu 12.04, GNOME, Chrome) by letting me download a
bunch of communities. :)

Good luck with your CasperJS approach, and I would love to learn from it if
you wish to share the code later.

------
_delirium
This reminds me indirectly that Google's Data Liberation blog hasn't had a
post in over a year:
[http://dataliberation.blogspot.com/](http://dataliberation.blogspot.com/)

Seems like if they were still active, easing this kind of export would be
within their stated goals?

------
fiatjaf
Are you AUTOMATING MOUSE CLICKS? I really like it, but I'm seeing some bugs
here I can't even describe.

~~~
anuragbiyani
To save a single page - yes (to be precise, it simulates key presses and does
not rely on any mouse action). The saving of the web page part is actually
handled by another script I wrote few days back
([https://github.com/abiyani/automate-save-page-
as](https://github.com/abiyani/automate-save-page-as)), and it was the only
option I could find to automate saving a page (as rendered by the chrome). I
will be very happy to learn of any other alternative for this task.

This particular script (orkut-community-downloader) uses "automate-save-page-
as" script to download every single page, and therefore contain the (simple)
logic for navigating through all the posts, polls, and later on changing the
links to point to local copy, symlinking stuff to save space, and providing
support for "resuming" the download operation.

Do you mind sharing what bugs you are facing (and which platform you are
testing on), so I can try and fix them ? If HN comments doesn't seem a fit
place, then feel free to email me (ID: <username>@gmail.com).

~~~
fiatjaf
Well, I'm using Linux Mint with Cinnamon and I've seen a "Save page as" dialog
after the orkut community listing page was opened, then the script tried to
write the path in the "Name" field of the dialog. It failed to erase the
default name then failed write the slashes, resulting in a name like
`homefiatjaforkut-community-downloaderPrevious-name-of-thepage.html` and error
that (luckly) caused the program to stop.

~~~
anuragbiyani
Can you test the following keystrokes manually and let me know the outcome
with your browser:

1) Open any page in chrome

2) Press Ctrl+S

3) Press "Right" arrow key exactly once

4) Type some string, say "-suffix"

5) Press "Home" key exactly once

6) Type in some directory name with "/" at the end, say: "/tmp/"

7) Press Enter.

Sorry if this sounds pedantic, but this is the actual sequence of key stroke
the script will mimic in this case, so just want to make sure that work in the
first place. Alternatively can you try [https://github.com/abiyani/automate-
save-page-as](https://github.com/abiyani/automate-save-page-as) script in
isolation (with --suffix command line flag).

PS: Under usual circumstances, the name field will should be replaced by
something like this by the script (before pressing enter in the "Save as"
dialog box):
"/fully/expanded/dest/directory/path/<original_page_title>-000X.html"

~~~
fiatjaf
It works almost always :P

(I mean the sequence of commands. It's kinda nice, now that you explained it.
I'll try running your script from another computer.)

